compareRecords {BRL} | R Documentation |
Create comparison vectors for all pairs of records coming from two datafiles to be linked.
compareRecords(
df1,
df2,
flds = NULL,
flds1 = NULL,
flds2 = NULL,
types = NULL,
breaks = c(0, 0.25, 0.5)
)
df1 , df2 |
two datasets to be linked, of class |
flds |
a vector indicating the fields to be used in the linkage. Either a |
flds1 , flds2 |
vectors indicating the fields of |
types |
a vector of characters indicating the comparison type per comparison field. The options
are: |
breaks |
break points for the comparisons to obtain levels of disagreement.
It can be a list of length equal to the number of comparison fields, containing one numeric vector with the break
points for each comparison field, where entries corresponding to comparison type |
a list containing:
comparisons
matrix with n1*n2
rows, where the comparison pattern for record pair (i,j)
appears in row (j-1)*n1+i
, for i
in {1,\dots,n1}
, and j
in {1,\dots,n2}
.
A comparison field with L+1
levels of disagreement,
is represented by L+1
columns of TRUE/FALSE indicators. Missing comparisons are coded as FALSE,
which is justified under an assumption of ignorability of the missing comparisons, see Sadinle (2017).
n1,n2
the datafile sizes, n1 = nrow(df1)
and n2 = nrow(df2)
.
nDisagLevs
a vector containing the number of levels of disagreement per comparison field.
compFields
a data frame containing the names of the fields in the datafiles used in the comparisons and the types of comparison.
Mauricio Sadinle (2017). Bayesian Estimation of Bipartite Matchings for Record Linkage. Journal of the American Statistical Association 112(518), 600-612. [Published] [arXiv]
data(twoFiles)
myCompData <- compareRecords(df1, df2,
flds=c("gname", "fname", "age", "occup"),
types=c("lv","lv","bi","bi"),
breaks=c(0,.25,.5))
## same as
myCompData <- compareRecords(df1, df2, types=c("lv","lv","bi","bi"))
## let's transform 'occup' to numeric to illustrate how to obtain numeric comparisons
df1$occup <- as.numeric(df1$occup)
df2$occup <- as.numeric(df2$occup)
## using different break points for 'lv' and 'nu' comparisons
myCompData1 <- compareRecords(df1, df2,
flds=c("gname", "fname", "age", "occup"),
types=c("lv","lv","bi","nu"),
breaks=list(lv=c(0,.25,.5), nu=0:3))
## using different break points for each comparison field
myCompData2 <- compareRecords(df1, df2,
flds=c("gname", "fname", "age", "occup"),
types=c("lv","lv","bi","nu"),
breaks=list(c(0,.25,.5), c(0,.2,.4,.6), NULL, 0:3))