DamerauLevenshtein {comparator}R Documentation

Damerau-Levenshtein String/Sequence Comparator

Description

The Damerau-Levenshtein distance between two strings/sequences xx and yy is the minimum cost of operations (insertions, deletions, substitutions or transpositions) required to transform xx into yy. It differs from the Levenshtein distance by including transpositions (swaps) among the allowable operations.

Usage

DamerauLevenshtein(
  deletion = 1,
  insertion = 1,
  substitution = 1,
  transposition = 1,
  normalize = FALSE,
  similarity = FALSE,
  ignore_case = FALSE,
  use_bytes = FALSE
)

Arguments

deletion

positive cost associated with deletion of a character or sequence element. Defaults to unit cost.

insertion

positive cost associated insertion of a character or sequence element. Defaults to unit cost.

substitution

positive cost associated with substitution of a character or sequence element. Defaults to unit cost.

transposition

positive cost associated with transposing (swapping) a pair of characters or sequence elements. Defaults to unit cost.

normalize

a logical. If TRUE, distances are normalized to the unit interval. Defaults to FALSE.

similarity

a logical. If TRUE, similarity scores are returned instead of distances. Defaults to FALSE.

ignore_case

a logical. If TRUE, case is ignored when comparing strings.

use_bytes

a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character.

Details

For simplicity we assume x and y are strings in this section, however the comparator is also implemented for more general sequences.

A Damerau-Levenshtein similarity is returned if similarity = TRUE, which is defined as

sim(x,y)=wdx+wiydist(x,y)2,\mathrm{sim}(x, y) = \frac{w_d |x| + w_i |y| - \mathrm{dist}(x, y)}{2},

where x|x|, y|y| are the number of characters in xx and yy respectively, dist\mathrm{dist} is the Damerau-Levenshtein distance, wdw_d is the cost of a deletion and wiw_i is the cost of an insertion.

Normalization of the Damerau-Levenshtein distance/similarity to the unit interval is also supported by setting normalize = TRUE. The normalization approach follows Yujian and Bo (2007), and ensures that the distance remains a metric when the costs of insertion wiw_i and deletion wdw_d are equal. The normalized distance distn\mathrm{dist}_n is defined as

distn(x,y)=2dist(x,y)wdx+wiy+dist(x,y),\mathrm{dist}_n(x, y) = \frac{2 \mathrm{dist}(x, y)}{w_d |x| + w_i |y| + \mathrm{dist}(x, y)},

and the normalized similarity simn\mathrm{sim}_n is defined as

simn(x,y)=1distn(x,y)=sim(x,y)wdx+wiysim(x,y).\mathrm{sim}_n(x, y) = 1 - \mathrm{dist}_n(x, y) = \frac{\mathrm{sim}(x, y)}{w_d |x| + w_i |y| - \mathrm{sim}(x, y)}.

Value

A DamerauLevenshtein instance is returned, which is an S4 class inheriting from Levenshtein.

Note

If the costs of deletion and insertion are equal, this comparator is symmetric in xx and yy. In addition, the normalized and unnormalized distances satisfy the properties of a metric.

References

Boytsov, L. (2011), "Indexing methods for approximate dictionary searching: Comparative analysis", ACM J. Exp. Algorithmics 16, Article 1.1.

Navarro, G. (2001), "A guided tour to approximate string matching", ACM Computing Surveys (CSUR), 33(1), 31-88.

Yujian, L. & Bo, L. (2007), "A Normalized Levenshtein Distance Metric", IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 1091-1095.

See Also

Other edit-based comparators include Hamming, LCS, Levenshtein and OSA.

Examples

## The Damerau-Levenshtein distance reduces to ordinary Levenshtein distance 
## when the cost of transpositions is high
x <- "plauge"; y <- "plague"
DamerauLevenshtein(transposition = 100)(x, y) == Levenshtein()(x, y)

## Compare car names using normalized Damerau-Levenshtein similarity
data(mtcars)
cars <- rownames(mtcars)
pairwise(DamerauLevenshtein(similarity = TRUE, normalize=TRUE), cars)

## Compare sequences using Damerau-Levenshtein distance
x <- c("G", "T", "G", "C", "T", "G", "G", "C", "C", "C", "A", "T")
y <- c("G", "T", "G", "C", "G", "T", "G", "C", "C", "C", "A", "T")
DamerauLevenshtein()(list(x), list(y))


[Package comparator version 0.1.2 Index]