twoFiles {BRL} | R Documentation |
Two Datasets for Record Linkage
Description
Two data frames, df1
and df2
, containing 300 and 150 records of artificially created
individuals, where 50 of them are included in both datafiles. In addition, the vector df2ID
contains one entry per record in df2
indicating the true matching between the datafiles, codified as follows:
a number smaller or equal to n1=300
in entry j
indicates the record in df1
to which record j
in df2
truly matches, and a number n1+j
indicates that record j
in df2
does not match any record in df1
.
Usage
data(twoFiles)
Source
Extracted from the datafiles used in the simulation studies of Sadinle (2017). The datafiles were originally generated using code provided by Peter Christen (https://users.cecs.anu.edu.au/~Peter.Christen/).
References
Mauricio Sadinle (2017). Bayesian Estimation of Bipartite Matchings for Record Linkage. Journal of the American Statistical Association 112(518), 600-612. [Published] [arXiv]
Examples
data(twoFiles)
n1 <- nrow(df1)
## the true matches
cbind( df1[df2ID[df2ID<=n1],], df2[df2ID<=n1,] )
## alternatively
df1$ID <- 1:n1
df2$ID <- df2ID
merge(df1, df2, by="ID")
## all the records in a merged file
merge(df1, df2, by="ID", all=TRUE)