Jaro {comparator} | R Documentation |
Jaro String/Sequence Comparator
Description
Compares a pair of strings/sequences x
and y
based on the number of
greedily-aligned characters/sequence elements and the number of
transpositions. It was developed for comparing names at the U.S. Census
Bureau.
Usage
Jaro(similarity = TRUE, ignore_case = FALSE, use_bytes = FALSE)
Arguments
similarity |
a logical. If TRUE, similarity scores are returned (default), otherwise distances are returned (see definition under Details). |
ignore_case |
a logical. If TRUE, case is ignored when comparing strings. |
use_bytes |
a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character. |
Details
For simplicity we assume x
and y
are strings in this section,
however the comparator is also implemented for more general sequences.
When similarity = TRUE
(default), the Jaro similarity is computed as
\mathrm{sim}(x, y) = \frac{1}{3}\left(\frac{m}{|x|} + \frac{m}{|y|} + \frac{m - \lfloor \frac{t}{2} \rfloor}{m}\right)
where m
is the number of "matching" characters (defined below),
t
is the number of "transpositions", and |x|,|y|
are the
lengths of the strings x
and y
. The similarity takes on values
in the range [0, 1]
, where 1 corresponds to a perfect match.
The number of "matching" characters m
is computed using a greedy
alignment algorithm. The algorithm iterates over the characters in x
,
attempting to align the i
-th character x_i
with the first
matching character in y
. When looking for matching characters in
y
, the algorithm only considers previously un-matched characters
within a window
[\max(0, i - w), \min(|y|, i + w)]
where w = \left\lfloor \frac{\max(|x|, |y|)}{2} \right\rfloor - 1
.
The alignment process yields a subsequence of matching characters from
x
and y
. The number of "transpositions" t
is defined to
be the number of positions in the subsequence of x
which are
misaligned with the corresponding position in y
.
When similarity = FALSE
, the Jaro distance is computed as
\mathrm{dist}(x,y) = 1 - \mathrm{sim}(x,y).
Value
A Jaro
instance is returned, which is an S4 class inheriting from
StringComparator
.
Note
The Jaro distance is not a metric, as it does not satisfy the
identity axiom \mathrm{dist}(x,y) = 0 \Leftrightarrow x = y.
References
Jaro, M. A. (1989), "Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida", Journal of the American Statistical Association 84(406), 414-420.
See Also
The JaroWinkler
comparator modifies the Jaro
comparator by
boosting the similarity score for strings/sequences that have matching
prefixes.
Examples
## Compare names
Jaro()("Martha", "Mathra")
Jaro()("Eileen", "Phyllis")