LCS {comparator} | R Documentation |
Longest Common Subsequence (LCS) Comparator
Description
The Longest Common Subsequence (LCS) distance between two
strings/sequences and
is the minimum cost of operations
(insertions and deletions) required to transform
into
.
The LCS similarity is more commonly used, which can be interpreted as the
length of the longest subsequence common to
and
.
Usage
LCS(
deletion = 1,
insertion = 1,
normalize = FALSE,
similarity = FALSE,
ignore_case = FALSE,
use_bytes = FALSE
)
Arguments
deletion |
positive cost associated with deletion of a character or sequence element. Defaults to unit cost. |
insertion |
positive cost associated insertion of a character or sequence element. Defaults to unit cost. |
normalize |
a logical. If TRUE, distances are normalized to the unit interval. Defaults to FALSE. |
similarity |
a logical. If TRUE, similarity scores are returned instead of distances. Defaults to FALSE. |
ignore_case |
a logical. If TRUE, case is ignored when comparing strings. |
use_bytes |
a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character. |
Details
For simplicity we assume x
and y
are strings in this section,
however the comparator is also implemented for more general sequences.
An LCS similarity is returned if similarity = TRUE
, which
is defined as
where ,
are the number of characters in
and
respectively,
is the LCS distance,
is the cost of a deletion and
is the cost of an insertion.
Normalization of the LCS distance/similarity to the unit interval
is also supported by setting normalize = TRUE
. The normalization approach
follows Yujian and Bo (2007), and ensures that the distance remains a metric
when the costs of insertion and deletion
are equal.
The normalized distance
is defined as
and the normalized similarity is defined as
Value
A LCS
instance is returned, which is an S4 class inheriting from
StringComparator
.
Note
If the costs of deletion and insertion are equal, this comparator is
symmetric in and
. In addition, the normalized and
unnormalized distances satisfy the properties of a metric.
References
Bergroth, L., Hakonen, H., & Raita, T. (2000), "A survey of longest common subsequence algorithms", Proceedings Seventh International Symposium on String Processing and Information Retrieval (SPIRE'00), 39-48.
Yujian, L. & Bo, L. (2007), "A Normalized Levenshtein Distance Metric", IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 1091–1095.
See Also
Other edit-based comparators include Hamming
, Levenshtein
,
OSA
and DamerauLevenshtein
.
Examples
## There are no common substrings of size 3 for the following example,
## however there are two common substrings of size 2: "AC" and "BC".
## Hence the LCS similarity is 2.
x <- "ABCDA"; y <- "BAC"
LCS(similarity = TRUE)(x, y)
## Levenshtein distance reduces to LCS distance when the cost of
## substitution is high
x <- "ABC"; y <- "AAA"
LCS()(x, y) == Levenshtein(substitution = 100)(x, y)