Jaro {comparator}R Documentation

Jaro String/Sequence Comparator

Description

Compares a pair of strings/sequences x and y based on the number of greedily-aligned characters/sequence elements and the number of transpositions. It was developed for comparing names at the U.S. Census Bureau.

Usage

Jaro(similarity = TRUE, ignore_case = FALSE, use_bytes = FALSE)

Arguments

similarity

a logical. If TRUE, similarity scores are returned (default), otherwise distances are returned (see definition under Details).

ignore_case

a logical. If TRUE, case is ignored when comparing strings.

use_bytes

a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character.

Details

For simplicity we assume x and y are strings in this section, however the comparator is also implemented for more general sequences.

When similarity = TRUE (default), the Jaro similarity is computed as

sim(x,y)=13(mx+my+mt2m)\mathrm{sim}(x, y) = \frac{1}{3}\left(\frac{m}{|x|} + \frac{m}{|y|} + \frac{m - \lfloor \frac{t}{2} \rfloor}{m}\right)

where mm is the number of "matching" characters (defined below), tt is the number of "transpositions", and x,y|x|,|y| are the lengths of the strings xx and yy. The similarity takes on values in the range [0,1][0, 1], where 1 corresponds to a perfect match.

The number of "matching" characters mm is computed using a greedy alignment algorithm. The algorithm iterates over the characters in xx, attempting to align the ii-th character xix_i with the first matching character in yy. When looking for matching characters in yy, the algorithm only considers previously un-matched characters within a window [max(0,iw),min(y,i+w)][\max(0, i - w), \min(|y|, i + w)] where w=max(x,y)21w = \left\lfloor \frac{\max(|x|, |y|)}{2} \right\rfloor - 1. The alignment process yields a subsequence of matching characters from xx and yy. The number of "transpositions" tt is defined to be the number of positions in the subsequence of xx which are misaligned with the corresponding position in yy.

When similarity = FALSE, the Jaro distance is computed as

dist(x,y)=1sim(x,y).\mathrm{dist}(x,y) = 1 - \mathrm{sim}(x,y).

Value

A Jaro instance is returned, which is an S4 class inheriting from StringComparator.

Note

The Jaro distance is not a metric, as it does not satisfy the identity axiom dist(x,y)=0x=y.\mathrm{dist}(x,y) = 0 \Leftrightarrow x = y.

References

Jaro, M. A. (1989), "Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida", Journal of the American Statistical Association 84(406), 414-420.

See Also

The JaroWinkler comparator modifies the Jaro comparator by boosting the similarity score for strings/sequences that have matching prefixes.

Examples

## Compare names
Jaro()("Martha", "Mathra")
Jaro()("Eileen", "Phyllis")


[Package comparator version 0.1.2 Index]