RLdata {RecordLinkage} | R Documentation |
Test data for Record Linkage
Description
The RLdata
tables contain artificial personal data for the
evaluation of Record Linkage procedures. Some records have been duplicated
with randomly generated errors. RLdata500
contains fifty duplicates,
RLdata10000
thousand duplicates.
Usage
RLdata500
RLdata10000
identity.RLdata500
identity.RLdata10000
Format
RLdata500
and RLdata10000
are character matrices with
500 and 10000 records. Each row represents one record, with the following
columns:
- fname_c1
First name, first component
- fname_c2
First name, second component
- lname_c1
Last name, first component
- lname_c2
Last name, second component
- by
Year of birth
- bm
Month of birth
- bd
Day of birth
identity.RLdata500
and identity.RLdata10000
are integer vectors
representing the true record
ids of the two data sets. Two records are duplicates, if and only if their
corresponding values in the identity vector agree.
Author(s)
Andreas Borg, Murat Sariyar
Source
Generated with the data generation component of Febrl (Freely Extensible Biomedical Record Linkage), version 0.3 (https://sourceforge.net/projects/febrl/). The following data sources were used (all relate to Germany):
https://blog.beliebte-vornamen.de/2009/02/prozentuale-anteile-2008/, a list of the frequencies of the 20 most popular female names in 2008.
https://www.beliebte-vornamen.de/760-alle_jahre.htm, a list of the 100 most popular first names since 1890. The frequencies found in the source above were extrapolated to fit this list.
http://www.ahnenforschung-in-stormarn.de/geneal/nachnamen_100.htm, a list of the 100 most frequent family names with frequencies.
Age distribution as of Dec 31st, 2008, statistics of Statistisches Bundesamt Deutschland, taken from the GENESIS database (https://www-genesis.destatis.de/genesis/online/logon).
Web links as of August 2020.