emClassify {RecordLinkage} | R Documentation |
Weight-based Classification of Data Pairs
Description
Classifies data pairs to which weights were assigned by emWeights
.
Based on user-defined thresholds or predefined error rates.
Usage
emClassify(rpairs, threshold.upper = Inf,
threshold.lower = threshold.upper, my = Inf, ny = Inf, ...)
## S4 method for signature 'RecLinkData,ANY,ANY'
emClassify(rpairs, threshold.upper = Inf,
threshold.lower = threshold.upper, my = Inf, ny = Inf)
## S4 method for signature 'RLBigData,ANY,ANY'
emClassify(rpairs, threshold.upper = Inf,
threshold.lower = threshold.upper, my = Inf, ny = Inf,
withProgressBar = (sink.number()==0))
Arguments
rpairs |
|
my |
A probability. Error bound for false positives. |
ny |
A probability. Error bound for false negatives. |
threshold.upper |
A numeric value. Threshold for links. |
threshold.lower |
A numeric value. Threshold for possible links. |
withProgressBar |
Whether to display a progress bar |
... |
Placeholder for method-specific arguments. |
Details
Two general approaches are implemented. The classical procedure
by Fellegi and Sunter (see references) minimizes the number of
possible links with given error levels for false links (my
) and
false non-links (ny
).
The second approach requires thresholds for links and possible links to be set
by the user. A pair with weight is classified as a link if
, as a possible link if
and as a non-link if
.
If threshold.upper
or threshold.lower
is given, the
threshold-based approach is used, otherwise, if one of the error bounds is
given, the Fellegi-Sunter model. If only my
is supplied, links are
chosen to meet the error bound and all other pairs are classified as non-links
(the equivalent case holds if only ny
is specified). If no further arguments
than rpairs
are given, a single threshold of 0 is used.
Value
For the "RecLinkData"
method, a S3 object
of class "RecLinkResult"
that represents a copy
of newdata
with element rpairs$prediction
, which stores
the classification result, as addendum.
For the "RLBigData"
method, a S4 object of class
"RLResult"
.
Note
The quality of classification of the Fellegi-Sunter method relies strongly on reasonable estimations of m- and u-probabilities. The results should be evaluated critically.
Author(s)
Andreas Borg, Murat Sariyar
References
Ivan P. Fellegi, Alan B. Sunter: A Theory for Record Linkage, in: Journal of the American Statistical Association Vol. 64, No. 328 (Dec., 1969), pp. 1183–1210.
See Also
getPairs
to produce output from which thresholds can
be determined conveniently.