R: Creation of Comparison Data

compareRecords {BRL}

R Documentation

Creation of Comparison Data

Description

Create comparison vectors for all pairs of records coming from two datafiles to be linked.

Usage

compareRecords(
  df1,
  df2,
  flds = NULL,
  flds1 = NULL,
  flds2 = NULL,
  types = NULL,
  breaks = c(0, 0.25, 0.5)
)

Arguments

`df1`, `df2`	two datasets to be linked, of class `data.frame`, with rows representing records and columns representing fields. Without loss of generality, `df1` is assumed to have no less records than `df2`.
`flds`	a vector indicating the fields to be used in the linkage. Either a `character` vector, in which case all entries need to be names of columns of `df1` and `df2`, or a `numeric` vector indicating the columns in `df1` and `df2` to be used in the linkage. If provided as a `numeric` vector it is assumed that the columns of `df1` and `df2` are organized such that it makes sense to compare the columns `df1[,flds]` and `df2[,flds]` in that order.
`flds1`, `flds2`	vectors indicating the fields of `df1` and `df2` to be used in the linkage. Either `character` vectors, in which case all entries need to be names of columns of `df1` and `df2`, respectively, or `numeric` vectors indicating the columns in `df1` and `df2` to be used in the linkage. It is assumed that it makes sense to compare the columns `df1[,flds1]` and `df2[,flds2]` in that order. These arguments are ignored if `flds` is specified. If none of `flds,flds1,flds2` are specified, the columns with the same names in `df1` and `df2` are compared, if any.
`types`	a vector of characters indicating the comparison type per comparison field. The options are: `"lv"` for comparisons based on the Levenshtein edit distance normalized to `[0,1]`, with `0` indicating no disagreement and `1` indicating maximum disagreement; `"bi"` for binary comparisons (agreement/disagreement); `"nu"` for numeric comparisons computed as the absolute difference. The default is `"lv"`. Fields compared with the `"lv"` option are first transformed to `character` class. Factors with different levels compared using the `"bi"` option are transformed to factors with the union of the levels. Fields compared with the `"nu"` option need to be of class `numeric`.
`breaks`	break points for the comparisons to obtain levels of disagreement. It can be a list of length equal to the number of comparison fields, containing one numeric vector with the break points for each comparison field, where entries corresponding to comparison type `"bi"` are ignored. It can also be a named list of length two with elements 'lv' and 'nu' containing numeric vectors with the break points for all Levenshtein-based and numeric comparisons, respectively. Finally, it can be a numeric vector with the break points for all comparison fields of type `"lv"` and `"nu"`, which might be meaningful only if all the non-binary comparisons are of a single type, either `"lv"` or `"nu"`. For comparisons based on the normalized Levenshtein distance, a vector of length `L` of break points for the interval `[0,1]` leads to `L+1` levels of disagreement. Similarly, for comparisons based on the absolute difference, the break points are for the interval `[0,\infty)`. The default is `breaks=c(0,.25,.5)`, which might be meaningful only for comparisons of type `"lv"`.

Value

a list containing:

comparisons: matrix with n1*n2 rows, where the comparison pattern for record pair (i,j) appears in row (j-1)*n1+i, for i in {1,\dots,n1}, and j in {1,\dots,n2}. A comparison field with L+1 levels of disagreement, is represented by L+1 columns of TRUE/FALSE indicators. Missing comparisons are coded as FALSE, which is justified under an assumption of ignorability of the missing comparisons, see Sadinle (2017).
n1,n2: the datafile sizes, n1 = nrow(df1) and n2 = nrow(df2).
nDisagLevs: a vector containing the number of levels of disagreement per comparison field.
compFields: a data frame containing the names of the fields in the datafiles used in the comparisons and the types of comparison.

References

Mauricio Sadinle (2017). Bayesian Estimation of Bipartite Matchings for Record Linkage. Journal of the American Statistical Association 112(518), 600-612. [Published] [arXiv]

Examples

data(twoFiles)

myCompData <- compareRecords(df1, df2, 
                             flds=c("gname", "fname", "age", "occup"),
                             types=c("lv","lv","bi","bi"), 
                             breaks=c(0,.25,.5))

## same as 
myCompData <- compareRecords(df1, df2, types=c("lv","lv","bi","bi"))


## let's transform 'occup' to numeric to illustrate how to obtain numeric comparisons 
df1$occup <- as.numeric(df1$occup)
df2$occup <- as.numeric(df2$occup)

## using different break points for 'lv' and 'nu' comparisons 
myCompData1 <- compareRecords(df1, df2, 
                              flds=c("gname", "fname", "age", "occup"),
                              types=c("lv","lv","bi","nu"), 
                              breaks=list(lv=c(0,.25,.5), nu=0:3))

## using different break points for each comparison field
myCompData2 <- compareRecords(df1, df2, 
                              flds=c("gname", "fname", "age", "occup"),
                              types=c("lv","lv","bi","nu"), 
                              breaks=list(c(0,.25,.5), c(0,.2,.4,.6), NULL, 0:3))

[Package BRL version 0.1.0 Index]