R: Optimal String Alignment (OSA) String/Sequence Comparator

OSA {comparator}

R Documentation

Optimal String Alignment (OSA) String/Sequence Comparator

Description

The Optimal String Alignment (OSA) distance between two strings/sequences x and y is the minimum cost of operations (insertions, deletions, substitutions or transpositions) required to transform x into y, subject to the constraint that no substring/subsequence is edited more than once.

Usage

OSA(
  deletion = 1,
  insertion = 1,
  substitution = 1,
  transposition = 1,
  normalize = FALSE,
  similarity = FALSE,
  ignore_case = FALSE,
  use_bytes = FALSE
)

Arguments

`deletion`	positive cost associated with deletion of a character or sequence element. Defaults to unit cost.
`insertion`	positive cost associated insertion of a character or sequence element. Defaults to unit cost.
`substitution`	positive cost associated with substitution of a character or sequence element. Defaults to unit cost.
`transposition`	positive cost associated with transposing (swapping) a pair of characters or sequence elements. Defaults to unit cost.
`normalize`	a logical. If TRUE, distances are normalized to the unit interval. Defaults to FALSE.
`similarity`	a logical. If TRUE, similarity scores are returned instead of distances. Defaults to FALSE.
`ignore_case`	a logical. If TRUE, case is ignored when comparing strings.
`use_bytes`	a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character.

Details

For simplicity we assume x and y are strings in this section, however the comparator is also implemented for more general sequences.

An OSA similarity is returned if similarity = TRUE, which is defined as

\mathrm{sim}(x, y) = \frac{w_d |x| + w_i |y| - \mathrm{dist}(x, y)}{2},

where |x|, |y| are the number of characters in x and y respectively, dist is the OSA distance, w_d is the cost of a deletion and w_i is the cost of an insertion.

Normalization of the OSA distance/similarity to the unit interval is also supported by setting normalize = TRUE. The normalization approach follows Yujian and Bo (2007), and ensures that the distance remains a metric when the costs of insertion w_i and deletion w_d are equal. The normalized distance \mathrm{dist}_n is defined as

\mathrm{dist}_n(x, y) = \frac{2 \mathrm{dist}(x, y)}{w_d |x| + w_i |y| + \mathrm{dist}(x, y)},

and the normalized similarity \mathrm{sim}_n is defined as

\mathrm{sim}_n(x, y) = 1 - \mathrm{dist}_n(x, y) = \frac{\mathrm{sim}(x, y)}{w_d |x| + w_i |y| - \mathrm{sim}(x, y)}.

Value

An OSA instance is returned, which is an S4 class inheriting from StringComparator.

Note

If the costs of deletion and insertion are equal, this comparator is symmetric in x and y. The OSA distance is not a proper metric as it does not satisfy the triangle inequality. The Damerau-Levenshtein distance is closely related—it allows the same edit operations as OSA, but removes the requirement that no substring can be edited more than once.

References

Boytsov, L. (2011), "Indexing methods for approximate dictionary searching: Comparative analysis", ACM J. Exp. Algorithmics 16, Article 1.1.

Navarro, G. (2001), "A guided tour to approximate string matching", ACM Computing Surveys (CSUR), 33(1), 31-88.

Yujian, L. & Bo, L. (2007), "A Normalized Levenshtein Distance Metric", IEEE Transactions on Pattern Analysis and Machine Intelligence 29: 1091–1095.

Examples

## Compare strings with a transposition error
x <- "plauge"; y <- "plague"
OSA()(x, y) != Levenshtein()(x, y)

## Unlike Damerau-Levenshtein, OSA does not allow a substring to be 
## edited more than once
x <- "ABC"; y <- "CA"
OSA()(x, y) != DamerauLevenshtein()(x, y)

## Compare car names using normalized OSA similarity
data(mtcars)
cars <- rownames(mtcars)
pairwise(OSA(similarity = TRUE, normalize=TRUE), cars)

[Package comparator version 0.1.2 Index]