optimalThreshold {RecordLinkage} | R Documentation |
Optimal Threshold for Record Linkage
Description
Calculates the optimal threshold for weight-based Record Linkage.
Usage
optimalThreshold(rpairs, my = NaN, ny = NaN)
## S4 method for signature 'RecLinkData'
optimalThreshold(rpairs, my = NaN, ny = NaN)
## S4 method for signature 'RLBigData'
optimalThreshold(rpairs, my = NaN, ny = NaN)
Arguments
rpairs |
Record pairs for which to calculate a threshold. |
my |
A real value in the range [0,1]. Error bound for false positives. |
ny |
A real value in the range [0,1]. Error bound for false negatives. |
Details
Weights must have been calculated for rpairs
, for example by
emWeights
or epiWeights
.
The true match result must be known for rpairs
, mostly this is provided
through the identity
argument of compare.*
For the following, it is assumed that all records with weights greater than or
equal to the threshold are classified as links, the remaining as non-links.
If no further arguments are given, a threshold which minimizes the
absolute number of misclassified record pairs is returned. If my
is
supplied (ny
is ignored in this case), a threshold is picked which
maximizes the number of correctly classified links while keeping the ratio
of false links to the total number of links below or equal my
.
If ny
is supplied, the number of correct non-links is maximized under the
condition that the ratio of falsely classified non-links to the total number of
non-links does not exceed ny
.
Two separate runs of optimalThreshold
with values for my
and
ny
respectively allow for obtaining a lower and an upper threshold
for a three-way classification approach (yielding links, non-links and
possible links).
Value
A numeric value, the calculated threshold.
Author(s)
Andreas Borg, Murat Sariyar
See Also
emWeights
emClassify
epiWeights
epiClassify
Examples
# create record pairs
data(RLdata500)
p=compare.dedup(RLdata500,identity=identity.RLdata500, strcmp=TRUE,
strcmpfun=levenshteinSim)
# calculate weights
p=epiWeights(p)
# split record pairs in two sets
l=splitData(dataset=p, prop=0.5, keep.mprop=TRUE)
# get threshold from training set
threshold=optimalThreshold(l$train)
# classify remaining data
summary(epiClassify(l$valid,threshold))