emWeights {RecordLinkage} | R Documentation |
Calculate weights
Description
Calculates weights for Record Linkage based on an EM algorithm.
Usage
emWeights(rpairs, cutoff = 0.95, ...)
## S4 method for signature 'RecLinkData'
emWeights(rpairs, cutoff = 0.95, ...)
## S4 method for signature 'RLBigData'
emWeights(rpairs, cutoff = 0.95,
verbose = TRUE, ...)
Arguments
rpairs |
The record pairs for which to compute weights. See details. |
cutoff |
Either a numeric value in the range [0,1] or a vector with the same length as the number of attributes in the data. Cutoff value for string comparator. |
verbose |
Logical. Whether to print progress messages. |
... |
Additional arguments passed to |
Details
Since package version 0.3, this is a generic functions with methods for
S3 objects of class RecLinkData
as well as S4 objects
of classes "RLBigDataDedup"
and
"RLBigDataLinkage"
.
The weight of a record pair is calculated by \log_{2}\frac{M}{U}
, where M
and U
are estimated m- and u-probabilities
for the present comparison pattern. If a string comparator is used, weights
are first calculated based on a binary table where all comparison
values greater or equal cutoff
are set to one, all other to zero.
The resulting weight is adjusted by adding for every pair
\log_{2}\left(\prod_{j:s^{i}_{j}\geq \textit{cutoff }}s^{i}_{j}\right)
, where
s^{i}_{j}
is the value of the string metric for attribute j in
data pair i.
The appropriate value of cutoff
depends on the choice of string
comparator. The default is adjusted to jarowinkler
,
a lower value (e.g. 0.7) is recommended for levenshteinSim
.
Estimation of M
and U
is done by an EM algorithm, implemented by
mygllm
. For every comparison
pattern, the estimated numbers of matches and non-matches are used to compute
the corresponding probabilities. Estimations based on the average
frequencies of values and given error rates are taken as initial values.
In our experience, this increases stability and performance of the
EM algorithm.
Some progress messages are printed to the message stream (see
message
if verbose == TRUE
.
This includes progress bars, but these are suppressed if output is diverted by
sink
to avoid cluttering the output file.
Value
A copy of rpairs
with the weights attached. See the class documentation
(RecLinkData
, "RLBigDataDedup"
and
"RLBigDataLinkage"
) on how weights are stored.
Side effects
The "RLBigData"
method writes to a disk file containing
a ffvector
that contains the calculated weights.
belonging to object
Author(s)
Andreas Borg, Murat Sariyar
References
William E. Winkler: Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage, in: Proceedings of the Section on Survey Research Methods, American Statistical Association 1988, pp. 667–671.
See Also
emClassify
for classification of weighted pairs.
epiWeights
for a different approach for weight calculation.