rl_reg1 {representr} | R Documentation |
500 records suitable for record linkage with additional regression variables
Description
Simulated datasets containing the name, birthdate, and additional attributes of 500 records, of which there are 350 unique individuals.
Usage
rl_reg1
rl_reg2
rl_reg5
identity.rl_reg1
identity.rl_reg2
identity.rl_reg5
linkage.rl
Format
rl_reg1
and rl_reg5
are data frames with 500 rows and 9 columns. Each row represents 1 records
with the following columns:
- fname
First name
- lname
Last name
- bm
Birth month (numeric)
- bd
Birth day
- by
Birth year
- sex
Sex ("M" or "F")
- education
Education level ("Less than a high school diploma", ""High school graduates, no college", "Some college or associate degree", "Bachelor's degree only", or "Advanced degree")
- income
Yearly income (in 1000s)
- bp
Systolic blood pressure
identity.rl_reg1
and identity.rl_reg5
are integer vectors indicating the true
record ids of the two datasets. Two records represent the same individual if and only if their
corresponding identity values are equal.
linkage.rl
contains the result of running 100,000 iterations of a record linkage model using
the package dblinkR
.
An object of class data.frame
with 500 rows and 9 columns.
An object of class data.frame
with 500 rows and 9 columns.
An object of class integer
of length 500.
An object of class integer
of length 500.
An object of class integer
of length 500.
An object of class matrix
(inherits from array
) with 100000 rows and 500 columns.
Details
There is a known relationship between three of the variables in the dataset, blood pressure (bp), income, and sex.
bp = 160 + 10I(sex = "M") - income + 0.5 income*I(sex = "M") + \epsilon
where \epsilon ~ Normal(0, \sigma^2)
and \sigma = 1, 2, 5
.
The 150 duplicated records have randomly generated errors.
Source
Names and birthdates generated with the ANU Online Personal Data Generator and Corruptor (GeCO) version 0.1 https://dmm.anu.edu.au/geco/.