DamerauLevenshtein {comparator} | R Documentation |
Damerau-Levenshtein String/Sequence Comparator
Description
The Damerau-Levenshtein distance between two strings/sequences x
and y
is the minimum cost of operations (insertions, deletions,
substitutions or transpositions) required to transform x
into y
. It differs from the Levenshtein distance by including
transpositions (swaps) among the allowable operations.
Usage
DamerauLevenshtein(
deletion = 1,
insertion = 1,
substitution = 1,
transposition = 1,
normalize = FALSE,
similarity = FALSE,
ignore_case = FALSE,
use_bytes = FALSE
)
Arguments
deletion |
positive cost associated with deletion of a character or sequence element. Defaults to unit cost. |
insertion |
positive cost associated insertion of a character or sequence element. Defaults to unit cost. |
substitution |
positive cost associated with substitution of a character or sequence element. Defaults to unit cost. |
transposition |
positive cost associated with transposing (swapping) a pair of characters or sequence elements. Defaults to unit cost. |
normalize |
a logical. If TRUE, distances are normalized to the unit interval. Defaults to FALSE. |
similarity |
a logical. If TRUE, similarity scores are returned instead of distances. Defaults to FALSE. |
ignore_case |
a logical. If TRUE, case is ignored when comparing strings. |
use_bytes |
a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character. |
Details
For simplicity we assume x
and y
are strings in this section,
however the comparator is also implemented for more general sequences.
A Damerau-Levenshtein similarity is returned if similarity = TRUE
, which
is defined as
\mathrm{sim}(x, y) = \frac{w_d |x| + w_i |y| - \mathrm{dist}(x, y)}{2},
where |x|
, |y|
are the number of characters in x
and
y
respectively, \mathrm{dist}
is the Damerau-Levenshtein
distance, w_d
is the cost of a deletion and w_i
is the cost of
an insertion.
Normalization of the Damerau-Levenshtein distance/similarity to the unit
interval is also supported by setting normalize = TRUE
. The normalization
approach follows Yujian and Bo (2007), and ensures that the distance
remains a metric when the costs of insertion w_i
and deletion
w_d
are equal. The normalized distance \mathrm{dist}_n
is defined as
\mathrm{dist}_n(x, y) = \frac{2 \mathrm{dist}(x, y)}{w_d |x| + w_i |y| + \mathrm{dist}(x, y)},
and the normalized similarity \mathrm{sim}_n
is defined as
\mathrm{sim}_n(x, y) = 1 - \mathrm{dist}_n(x, y) = \frac{\mathrm{sim}(x, y)}{w_d |x| + w_i |y| - \mathrm{sim}(x, y)}.
Value
A DamerauLevenshtein
instance is returned, which is an S4 class inheriting
from Levenshtein
.
Note
If the costs of deletion and insertion are equal, this comparator is
symmetric in x
and y
. In addition, the normalized and
unnormalized distances satisfy the properties of a metric.
References
Boytsov, L. (2011), "Indexing methods for approximate dictionary searching: Comparative analysis", ACM J. Exp. Algorithmics 16, Article 1.1.
Navarro, G. (2001), "A guided tour to approximate string matching", ACM Computing Surveys (CSUR), 33(1), 31-88.
Yujian, L. & Bo, L. (2007), "A Normalized Levenshtein Distance Metric", IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 1091-1095.
See Also
Other edit-based comparators include Hamming
, LCS
,
Levenshtein
and OSA
.
Examples
## The Damerau-Levenshtein distance reduces to ordinary Levenshtein distance
## when the cost of transpositions is high
x <- "plauge"; y <- "plague"
DamerauLevenshtein(transposition = 100)(x, y) == Levenshtein()(x, y)
## Compare car names using normalized Damerau-Levenshtein similarity
data(mtcars)
cars <- rownames(mtcars)
pairwise(DamerauLevenshtein(similarity = TRUE, normalize=TRUE), cars)
## Compare sequences using Damerau-Levenshtein distance
x <- c("G", "T", "G", "C", "T", "G", "G", "C", "C", "C", "A", "T")
y <- c("G", "T", "G", "C", "G", "T", "G", "C", "C", "C", "A", "T")
DamerauLevenshtein()(list(x), list(y))