em_winkler {ludic}R Documentation

Implementation of Winkler's EM algorithm for Fellegi-Sunter matching method

Description

em_winkler_big implements the same method when the data are too big to compute the agreement matrix. Agreement is then recomputed on the fly each time it is needed. The EM steps are completely done in C++. This decreases the RAM usage (still important though), at the cost of increasing computational time.

Usage

em_winkler(
  data1,
  data2,
  tol = 0.001,
  maxit = 500,
  do_plot = TRUE,
  oneone = FALSE,
  verbose = FALSE
)

em_winkler_big(
  data1,
  data2,
  tol = 0.001,
  maxit = 500,
  do_plot = TRUE,
  oneone = FALSE,
  verbose = FALSE
)

Arguments

data1

either a binary (1 or 0 values only) matrix or binary data frame of dimension n1 x K whose rownames are the observation identifiers.

data2

either a binary (1 or 0 values only) matrix or a binary data frame of dimension n2 x K whose rownames are the observation identifiers.

tol

tolerance for the EM algorithm convergence.

maxit

maximum number of iterations for the EM algorithm.

do_plot

a logical flag indicating whether a plot should be drawn for the EM convergence. Default is TRUE.

oneone

a logical flag indicating whether 1-1 matching should be enforced. If TRUE, then returned matchingScores are only kept for the maximum score per column while lower scores are replace by threshold-1. Default is FALSE in which case original matchingScores are returned.

verbose

a logical flag indicating whether intermediate values from the EM algorithm should be printed. Useful for debugging. Default is FALSE.

Value

a list containing:

References

Winkler WE. Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage. Proc Sect Surv Res Methods, Am Stat Assoc 1988: 667-71.

Grannis SJ, Overhage JM, Hui S, et al. Analysis of a probabilistic record linkage technique without human review. AMIA 2003 Symp Proc 2003: 259-63.

Examples

mat1 <- matrix(round(rnorm(n=1000, sd=1.2)), ncol=10, nrow=100)
mat2 <- rbind(mat1[1:10, ],
             matrix(round(rnorm(n=900, sd=1.2)), ncol=10, nrow=90)
             )
rownames(mat1) <- paste0("A", 1:nrow(mat1))
rownames(mat1) <- paste0("B", 1:nrow(mat1))
mat1 <- 1*(mat1>1)
mat2 <- 1*(mat2>1)
em_winkler(mat1, mat2)


[Package ludic version 0.2.0 Index]