R: Test data for Record Linkage

RLdata {RecordLinkage}

R Documentation

Test data for Record Linkage

Description

The RLdata tables contain artificial personal data for the evaluation of Record Linkage procedures. Some records have been duplicated with randomly generated errors. RLdata500 contains fifty duplicates, RLdata10000 thousand duplicates.

Usage

RLdata500 
RLdata10000
identity.RLdata500 
identity.RLdata10000

Format

RLdata500 and RLdata10000 are character matrices with 500 and 10000 records. Each row represents one record, with the following columns:

fname_c1: First name, first component
fname_c2: First name, second component
lname_c1: Last name, first component
lname_c2: Last name, second component
by: Year of birth
bm: Month of birth
bd: Day of birth

identity.RLdata500 and identity.RLdata10000 are integer vectors representing the true record ids of the two data sets. Two records are duplicates, if and only if their corresponding values in the identity vector agree.

Author(s)

Andreas Borg, Murat Sariyar

Source

Generated with the data generation component of Febrl (Freely Extensible Biomedical Record Linkage), version 0.3 (https://sourceforge.net/projects/febrl/). The following data sources were used (all relate to Germany):

https://blog.beliebte-vornamen.de/2009/02/prozentuale-anteile-2008/, a list of the frequencies of the 20 most popular female names in 2008.

https://www.beliebte-vornamen.de/760-alle_jahre.htm, a list of the 100 most popular first names since 1890. The frequencies found in the source above were extrapolated to fit this list.

http://www.ahnenforschung-in-stormarn.de/geneal/nachnamen_100.htm, a list of the 100 most frequent family names with frequencies.

Age distribution as of Dec 31st, 2008, statistics of Statistisches Bundesamt Deutschland, taken from the GENESIS database (https://www-genesis.destatis.de/genesis/online/logon).

Web links as of August 2020.

[Package RecordLinkage version 0.4-12.4 Index]