compareRecords {BRL} | R Documentation |
Creation of Comparison Data
Description
Create comparison vectors for all pairs of records coming from two datafiles to be linked.
Usage
compareRecords(
df1,
df2,
flds = NULL,
flds1 = NULL,
flds2 = NULL,
types = NULL,
breaks = c(0, 0.25, 0.5)
)
Arguments
df1 , df2 |
two datasets to be linked, of class |
flds |
a vector indicating the fields to be used in the linkage. Either a |
flds1 , flds2 |
vectors indicating the fields of |
types |
a vector of characters indicating the comparison type per comparison field. The options
are: |
breaks |
break points for the comparisons to obtain levels of disagreement.
It can be a list of length equal to the number of comparison fields, containing one numeric vector with the break
points for each comparison field, where entries corresponding to comparison type |
Value
a list containing:
comparisons
-
matrix with
n1*n2
rows, where the comparison pattern for record pair(i,j)
appears in row(j-1)*n1+i
, fori
in{1,\dots,n1}
, andj
in{1,\dots,n2}
. A comparison field withL+1
levels of disagreement, is represented byL+1
columns of TRUE/FALSE indicators. Missing comparisons are coded as FALSE, which is justified under an assumption of ignorability of the missing comparisons, see Sadinle (2017). n1,n2
the datafile sizes,
n1 = nrow(df1)
andn2 = nrow(df2)
.nDisagLevs
a vector containing the number of levels of disagreement per comparison field.
compFields
a data frame containing the names of the fields in the datafiles used in the comparisons and the types of comparison.
References
Mauricio Sadinle (2017). Bayesian Estimation of Bipartite Matchings for Record Linkage. Journal of the American Statistical Association 112(518), 600-612. [Published] [arXiv]
Examples
data(twoFiles)
myCompData <- compareRecords(df1, df2,
flds=c("gname", "fname", "age", "occup"),
types=c("lv","lv","bi","bi"),
breaks=c(0,.25,.5))
## same as
myCompData <- compareRecords(df1, df2, types=c("lv","lv","bi","bi"))
## let's transform 'occup' to numeric to illustrate how to obtain numeric comparisons
df1$occup <- as.numeric(df1$occup)
df2$occup <- as.numeric(df2$occup)
## using different break points for 'lv' and 'nu' comparisons
myCompData1 <- compareRecords(df1, df2,
flds=c("gname", "fname", "age", "occup"),
types=c("lv","lv","bi","nu"),
breaks=list(lv=c(0,.25,.5), nu=0:3))
## using different break points for each comparison field
myCompData2 <- compareRecords(df1, df2,
flds=c("gname", "fname", "age", "occup"),
types=c("lv","lv","bi","nu"),
breaks=list(c(0,.25,.5), c(0,.2,.4,.6), NULL, 0:3))