R: Longest Common Subsequence (LCS) Comparator

LCS {comparator}

R Documentation

Longest Common Subsequence (LCS) Comparator

Description

The Longest Common Subsequence (LCS) distance between two strings/sequences x and y is the minimum cost of operations (insertions and deletions) required to transform x into y. The LCS similarity is more commonly used, which can be interpreted as the length of the longest subsequence common to x and y.

Usage

LCS(
  deletion = 1,
  insertion = 1,
  normalize = FALSE,
  similarity = FALSE,
  ignore_case = FALSE,
  use_bytes = FALSE
)

Arguments

`deletion`	positive cost associated with deletion of a character or sequence element. Defaults to unit cost.
`insertion`	positive cost associated insertion of a character or sequence element. Defaults to unit cost.
`normalize`	a logical. If TRUE, distances are normalized to the unit interval. Defaults to FALSE.
`similarity`	a logical. If TRUE, similarity scores are returned instead of distances. Defaults to FALSE.
`ignore_case`	a logical. If TRUE, case is ignored when comparing strings.
`use_bytes`	a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character.

Details

For simplicity we assume x and y are strings in this section, however the comparator is also implemented for more general sequences.

An LCS similarity is returned if similarity = TRUE, which is defined as

\mathrm{sim}(x, y) = \frac{w_d |x| + w_i |y| - \mathrm{dist}(x, y)}{2},

where |x|, |y| are the number of characters in x and y respectively, dist is the LCS distance, w_d is the cost of a deletion and w_i is the cost of an insertion.

Normalization of the LCS distance/similarity to the unit interval is also supported by setting normalize = TRUE. The normalization approach follows Yujian and Bo (2007), and ensures that the distance remains a metric when the costs of insertion w_i and deletion w_d are equal. The normalized distance \mathrm{dist}_n is defined as

\mathrm{dist}_n(x, y) = \frac{2 \mathrm{dist}(x, y)}{w_d |x| + w_i |y| + \mathrm{dist}(x, y)},

and the normalized similarity \mathrm{sim}_n is defined as

\mathrm{sim}_n(x, y) = 1 - \mathrm{dist}_n(x, y) = \frac{\mathrm{sim}(x, y)}{w_d |x| + w_i |y| - \mathrm{sim}(x, y)}.

Value

A LCS instance is returned, which is an S4 class inheriting from StringComparator.

Note

If the costs of deletion and insertion are equal, this comparator is symmetric in x and y. In addition, the normalized and unnormalized distances satisfy the properties of a metric.

References

Bergroth, L., Hakonen, H., & Raita, T. (2000), "A survey of longest common subsequence algorithms", Proceedings Seventh International Symposium on String Processing and Information Retrieval (SPIRE'00), 39-48.

Yujian, L. & Bo, L. (2007), "A Normalized Levenshtein Distance Metric", IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 1091–1095.

Examples

## There are no common substrings of size 3 for the following example, 
## however there are two common substrings of size 2: "AC" and "BC". 
## Hence the LCS similarity is 2.
x <- "ABCDA"; y <- "BAC"
LCS(similarity = TRUE)(x, y)

## Levenshtein distance reduces to LCS distance when the cost of 
## substitution is high
x <- "ABC"; y <- "AAA"
LCS()(x, y) == Levenshtein(substitution = 100)(x, y)

[Package comparator version 0.1.2 Index]