adist {utils} | R Documentation |
Approximate String Distances
Description
Compute the approximate string distance between character vectors. The distance is a generalized Levenshtein (edit) distance, giving the minimal possibly weighted number of insertions, deletions and substitutions needed to transform one string into another.
Usage
adist(x, y = NULL, costs = NULL, counts = FALSE, fixed = TRUE,
partial = !fixed, ignore.case = FALSE, useBytes = FALSE)
Arguments
x |
a character vector. Long vectors are not supported. |
y |
a character vector, or |
costs |
a numeric vector or list with names partially matching
‘insertions’, ‘deletions’ and ‘substitutions’ giving
the respective costs for computing the Levenshtein distance, or
|
counts |
a logical indicating whether to optionally return the
transformation counts (numbers of insertions, deletions and
substitutions) as the |
fixed |
a logical. If |
partial |
a logical indicating whether the transformed |
ignore.case |
a logical. If |
useBytes |
a logical. If |
Details
The (generalized) Levenshtein (or edit) distance between two strings
s and t is the minimal possibly weighted number of
insertions, deletions and substitutions needed to transform s
into t (so that the transformation exactly matches t).
This distance is computed for partial = FALSE
, currently using
a dynamic programming algorithm (see, e.g.,
https://en.wikipedia.org/wiki/Levenshtein_distance) with space
and time complexity O(mn)
, where m
and n
are the
lengths of s and t, respectively. Additionally computing
the transformation sequence and counts is O(\max(m, n))
.
The generalized Levenshtein distance can also be used for approximate
(fuzzy) string matching, in which case one finds the substring of
t with minimal distance to the pattern s (which could be
taken as a regular expression, in which case the principle of using
the leftmost and longest match applies), see, e.g.,
https://en.wikipedia.org/wiki/Approximate_string_matching. This
distance is computed for partial = TRUE
using ‘tre’ by
Ville Laurikari (https://github.com/laurikari/tre) and
corresponds to the distance used by agrep
. In this
case, the given cost values are coerced to integer.
Note that the costs for insertions and deletions can be different, in which case the distance between s and t can be different from the distance between t and s.
Value
A matrix with the approximate string distances of the elements of
x
and y
, with rows and columns corresponding to x
and y
, respectively.
If counts
is TRUE
, the transformation counts are
returned as the "counts"
attribute of this matrix, as a
3-dimensional array with dimensions corresponding to the elements of
x
, the elements of y
, and the type of transformation
(insertions, deletions and substitutions), respectively.
Additionally, if partial = FALSE
, the transformation sequences
are returned as the "trafos"
attribute of the return value, as
character strings with elements ‘M’, ‘I’, ‘D’ and
‘S’ indicating a match, insertion, deletion and substitution,
respectively. If partial = TRUE
, the offsets (positions of
the first and last element) of the matched substrings are returned as
the "offsets"
attribute of the return value (with both offsets
-1
in case of no match).
See Also
agrep
for approximate string matching (fuzzy matching)
using the generalized Levenshtein distance.
Examples
## Cf. https://en.wikipedia.org/wiki/Levenshtein_distance
adist("kitten", "sitting")
## To see the transformation counts for the Levenshtein distance:
drop(attr(adist("kitten", "sitting", counts = TRUE), "counts"))
## To see the transformation sequences:
attr(adist(c("kitten", "sitting"), counts = TRUE), "trafos")
## Cf. the examples for agrep:
adist("lasy", "1 lazy 2")
## For a "partial approximate match" (as used for agrep):
adist("lasy", "1 lazy 2", partial = TRUE)