em_winkler {ludic} | R Documentation |
Implementation of Winkler's EM algorithm for Fellegi-Sunter matching method
Description
em_winkler_big
implements the same method when the data are too big to compute
the agreement matrix. Agreement is then recomputed on the fly each time it is needed. The EM steps
are completely done in C++. This decreases the RAM usage (still important though), at the cost of
increasing computational time.
Usage
em_winkler(
data1,
data2,
tol = 0.001,
maxit = 500,
do_plot = TRUE,
oneone = FALSE,
verbose = FALSE
)
em_winkler_big(
data1,
data2,
tol = 0.001,
maxit = 500,
do_plot = TRUE,
oneone = FALSE,
verbose = FALSE
)
Arguments
data1 |
either a binary ( |
data2 |
either a binary ( |
tol |
tolerance for the EM algorithm convergence. |
maxit |
maximum number of iterations for the EM algorithm. |
do_plot |
a logical flag indicating whether a plot should be drawn for the EM convergence.
Default is |
oneone |
a logical flag indicating whether 1-1 matching should be enforced.
If |
verbose |
a logical flag indicating whether intermediate values from the EM algorithm should
be printed. Useful for debugging. Default is |
Value
a list containing:
matchingScore
a matrix of sizen1 x n2
with the matching score for eachn1*n2
pair.threshold_ms
threshold value for the matching scores above which pairs are considered true matches.estim_nbmatch
an estimation of the number of true matches (N
pairs considered multiplied byp
the estimated proportion of true matches from the EM algorithm)convergence_status
a logical flag indicating whether the EM algorithm converged
References
Winkler WE. Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage. Proc Sect Surv Res Methods, Am Stat Assoc 1988: 667-71.
Grannis SJ, Overhage JM, Hui S, et al. Analysis of a probabilistic record linkage technique without human review. AMIA 2003 Symp Proc 2003: 259-63.
Examples
mat1 <- matrix(round(rnorm(n=1000, sd=1.2)), ncol=10, nrow=100)
mat2 <- rbind(mat1[1:10, ],
matrix(round(rnorm(n=900, sd=1.2)), ncol=10, nrow=90)
)
rownames(mat1) <- paste0("A", 1:nrow(mat1))
rownames(mat1) <- paste0("B", 1:nrow(mat1))
mat1 <- 1*(mat1>1)
mat2 <- 1*(mat2>1)
em_winkler(mat1, mat2)