twoFiles {BRL}R Documentation

Two Datasets for Record Linkage


Two data frames, df1 and df2, containing 300 and 150 records of artificially created individuals, where 50 of them are included in both datafiles. In addition, the vector df2ID contains one entry per record in df2 indicating the true matching between the datafiles, codified as follows: a number smaller or equal to n1=300 in entry j indicates the record in df1 to which record j in df2 truly matches, and a number n1+j indicates that record j in df2 does not match any record in df1.




Extracted from the datafiles used in the simulation studies of Sadinle (2017). The datafiles were originally generated using code provided by Peter Christen (


Mauricio Sadinle (2017). Bayesian Estimation of Bipartite Matchings for Record Linkage. Journal of the American Statistical Association 112(518), 600-612. [Published] [arXiv]



n1 <- nrow(df1)

## the true matches
cbind( df1[df2ID[df2ID<=n1],], df2[df2ID<=n1,] )

## alternatively
df1$ID <- 1:n1
df2$ID <- df2ID
merge(df1, df2, by="ID")

## all the records in a merged file
merge(df1, df2, by="ID", all=TRUE)

[Package BRL version 0.1.0 Index]