OSA {comparator} | R Documentation |
Optimal String Alignment (OSA) String/Sequence Comparator
Description
The Optimal String Alignment (OSA) distance between two strings/sequences
x
and y
is the minimum cost of operations (insertions,
deletions, substitutions or transpositions) required to transform x
into y
, subject to the constraint that no substring/subsequence is
edited more than once.
Usage
OSA(
deletion = 1,
insertion = 1,
substitution = 1,
transposition = 1,
normalize = FALSE,
similarity = FALSE,
ignore_case = FALSE,
use_bytes = FALSE
)
Arguments
deletion |
positive cost associated with deletion of a character or sequence element. Defaults to unit cost. |
insertion |
positive cost associated insertion of a character or sequence element. Defaults to unit cost. |
substitution |
positive cost associated with substitution of a character or sequence element. Defaults to unit cost. |
transposition |
positive cost associated with transposing (swapping) a pair of characters or sequence elements. Defaults to unit cost. |
normalize |
a logical. If TRUE, distances are normalized to the unit interval. Defaults to FALSE. |
similarity |
a logical. If TRUE, similarity scores are returned instead of distances. Defaults to FALSE. |
ignore_case |
a logical. If TRUE, case is ignored when comparing strings. |
use_bytes |
a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character. |
Details
For simplicity we assume x
and y
are strings in this section,
however the comparator is also implemented for more general sequences.
An OSA similarity is returned if similarity = TRUE
, which
is defined as
\mathrm{sim}(x, y) = \frac{w_d |x| + w_i |y| - \mathrm{dist}(x, y)}{2},
where |x|
, |y|
are the number of characters in x
and
y
respectively, dist
is the OSA distance, w_d
is the cost of a deletion and w_i
is the cost of an insertion.
Normalization of the OSA distance/similarity to the unit interval
is also supported by setting normalize = TRUE
. The normalization approach
follows Yujian and Bo (2007), and ensures that the distance remains a metric
when the costs of insertion w_i
and deletion w_d
are equal.
The normalized distance \mathrm{dist}_n
is defined as
\mathrm{dist}_n(x, y) = \frac{2 \mathrm{dist}(x, y)}{w_d |x| + w_i |y| + \mathrm{dist}(x, y)},
and the normalized similarity \mathrm{sim}_n
is defined as
\mathrm{sim}_n(x, y) = 1 - \mathrm{dist}_n(x, y) = \frac{\mathrm{sim}(x, y)}{w_d |x| + w_i |y| - \mathrm{sim}(x, y)}.
Value
An OSA
instance is returned, which is an S4 class inheriting from
StringComparator
.
Note
If the costs of deletion and insertion are equal, this comparator is
symmetric in x
and y
. The OSA distance is not a proper metric
as it does not satisfy the triangle inequality. The Damerau-Levenshtein
distance is closely related—it allows the same edit operations as OSA,
but removes the requirement that no substring can be edited more than once.
References
Boytsov, L. (2011), "Indexing methods for approximate dictionary searching: Comparative analysis", ACM J. Exp. Algorithmics 16, Article 1.1.
Navarro, G. (2001), "A guided tour to approximate string matching", ACM Computing Surveys (CSUR), 33(1), 31-88.
Yujian, L. & Bo, L. (2007), "A Normalized Levenshtein Distance Metric", IEEE Transactions on Pattern Analysis and Machine Intelligence 29: 1091–1095.
See Also
Other edit-based comparators include Hamming
, LCS
,
Levenshtein
and DamerauLevenshtein
.
Examples
## Compare strings with a transposition error
x <- "plauge"; y <- "plague"
OSA()(x, y) != Levenshtein()(x, y)
## Unlike Damerau-Levenshtein, OSA does not allow a substring to be
## edited more than once
x <- "ABC"; y <- "CA"
OSA()(x, y) != DamerauLevenshtein()(x, y)
## Compare car names using normalized OSA similarity
data(mtcars)
cars <- rownames(mtcars)
pairwise(OSA(similarity = TRUE, normalize=TRUE), cars)