| optimalThreshold {RecordLinkage} | R Documentation |
Optimal Threshold for Record Linkage
Description
Calculates the optimal threshold for weight-based Record Linkage.
Usage
optimalThreshold(rpairs, my = NaN, ny = NaN)
## S4 method for signature 'RecLinkData'
optimalThreshold(rpairs, my = NaN, ny = NaN)
## S4 method for signature 'RLBigData'
optimalThreshold(rpairs, my = NaN, ny = NaN)
Arguments
rpairs |
Record pairs for which to calculate a threshold. |
my |
A real value in the range [0,1]. Error bound for false positives. |
ny |
A real value in the range [0,1]. Error bound for false negatives. |
Details
Weights must have been calculated for rpairs, for example by
emWeights or epiWeights.
The true match result must be known for rpairs, mostly this is provided
through the identity argument of compare.*
For the following, it is assumed that all records with weights greater than or
equal to the threshold are classified as links, the remaining as non-links.
If no further arguments are given, a threshold which minimizes the
absolute number of misclassified record pairs is returned. If my is
supplied (ny is ignored in this case), a threshold is picked which
maximizes the number of correctly classified links while keeping the ratio
of false links to the total number of links below or equal my.
If ny is supplied, the number of correct non-links is maximized under the
condition that the ratio of falsely classified non-links to the total number of
non-links does not exceed ny.
Two separate runs of optimalThreshold with values for my and
ny respectively allow for obtaining a lower and an upper threshold
for a three-way classification approach (yielding links, non-links and
possible links).
Value
A numeric value, the calculated threshold.
Author(s)
Andreas Borg, Murat Sariyar
See Also
emWeights
emClassify
epiWeights
epiClassify
Examples
# create record pairs
data(RLdata500)
p=compare.dedup(RLdata500,identity=identity.RLdata500, strcmp=TRUE,
strcmpfun=levenshteinSim)
# calculate weights
p=epiWeights(p)
# split record pairs in two sets
l=splitData(dataset=p, prop=0.5, keep.mprop=TRUE)
# get threshold from training set
threshold=optimalThreshold(l$train)
# classify remaining data
summary(epiClassify(l$valid,threshold))