JaroWinkler {comparator} | R Documentation |
Jaro-Winkler String/Sequence Comparator
Description
The Jaro-Winkler comparator is a variant of the Jaro
comparator which
boosts the similarity score for strings/sequences with matching prefixes.
It was developed for comparing names at the U.S. Census Bureau.
Usage
JaroWinkler(
p = 0.1,
threshold = 0.7,
max_prefix = 4L,
similarity = TRUE,
ignore_case = FALSE,
use_bytes = FALSE
)
Arguments
p |
a non-negative numeric scalar no larger than 1/max_prefix. Similarity scores eligible for boosting are scaled by this factor. |
threshold |
a numeric scalar on the unit interval. Jaro similarities greater than this value are boosted based on matching characters in the prefixes of both strings. Jaro similarities below this value are returned unadjusted. Defaults to 0.7. |
max_prefix |
a non-negative integer scalar, specifying the size of the prefix to consider for boosting. Defaults to 4 (characters). |
similarity |
a logical. If TRUE, similarity scores are returned (default), otherwise distances are returned (see definition under Details). |
ignore_case |
a logical. If TRUE, case is ignored when comparing strings. |
use_bytes |
a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character. |
Details
For simplicity we assume x
and y
are strings in this section,
however the comparator is also implemented for more general sequences.
The Jaro-Winkler similarity (computed when similarity = TRUE
) is
defined in terms of the Jaro
similarity. If the Jaro similarity
sim_J(x,y)
between strings x
and y
exceeds a
user-specified threshold 0 \leq \tau \leq 1
,
the similarity score is boosted in proportion to the number of matching
characters in the prefixes of x
and y
. More precisely, the
Jaro-Winkler similarity is defined as:
\mathrm{sim}_{JW}(x, y) = \mathrm{sim}_J(x, y) + \min(c(x, y), l) p (1 - \mathrm{sim}_J(x, y)),
where c(x,y)
is the length of the common prefix, l \geq 0
is a user-specified upper bound on the prefix size, and
0 \leq p \leq 1/l
is a scaling factor.
The Jaro-Winkler distance is computed when similarity = FALSE
and is
defined as
\mathrm{dist}_{JW}(x, y) = 1 - \mathrm{sim}_{JW}(x, y).
Value
A JaroWinkler
instance is returned, which is an S4 class inheriting from
StringComparator
.
Note
Like the Jaro distance, the Jaro-Winkler distance is not a metric as it does not satisfy the identity axiom.
References
Jaro, M. A. (1989), "Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida", Journal of the American Statistical Association 84(406), 414-420.
Winkler, W. E. (2006), "Overview of Record Linkage and Current Research Directions", Tech. report. Statistics #2006-2. Statistical Research Division, U.S. Census Bureau.
Winkler, W., McLaughlin G., Jaro M. and Lynch M. (1994), strcmp95.c, Version 2. United States Census Bureau.
See Also
This comparator reduces to the Jaro
comparator when max_prefix = 0L
or threshold = 0.0
.
Examples
## Compare names
JaroWinkler()("Martha", "Mathra")
JaroWinkler()("Eileen", "Phyllis")
## Reduce the threshold for boosting
x <- "Matthew"
y <- "Martin"
JaroWinkler()(x, y) < JaroWinkler(threshold = 0.5)(x, y)