OSA {comparator} | R Documentation |

## Optimal String Alignment (OSA) String/Sequence Comparator

### Description

The Optimal String Alignment (OSA) distance between two strings/sequences
`x`

and `y`

is the minimum cost of operations (insertions,
deletions, substitutions or transpositions) required to transform `x`

into `y`

, subject to the constraint that *no substring/subsequence is
edited more than once*.

### Usage

```
OSA(
deletion = 1,
insertion = 1,
substitution = 1,
transposition = 1,
normalize = FALSE,
similarity = FALSE,
ignore_case = FALSE,
use_bytes = FALSE
)
```

### Arguments

`deletion` |
positive cost associated with deletion of a character or sequence element. Defaults to unit cost. |

`insertion` |
positive cost associated insertion of a character or sequence element. Defaults to unit cost. |

`substitution` |
positive cost associated with substitution of a character or sequence element. Defaults to unit cost. |

`transposition` |
positive cost associated with transposing (swapping) a pair of characters or sequence elements. Defaults to unit cost. |

`normalize` |
a logical. If TRUE, distances are normalized to the unit interval. Defaults to FALSE. |

`similarity` |
a logical. If TRUE, similarity scores are returned instead of distances. Defaults to FALSE. |

`ignore_case` |
a logical. If TRUE, case is ignored when comparing strings. |

`use_bytes` |
a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character. |

### Details

For simplicity we assume `x`

and `y`

are strings in this section,
however the comparator is also implemented for more general sequences.

An OSA similarity is returned if `similarity = TRUE`

, which
is defined as

`\mathrm{sim}(x, y) = \frac{w_d |x| + w_i |y| - \mathrm{dist}(x, y)}{2},`

where `|x|`

, `|y|`

are the number of characters in `x`

and
`y`

respectively, `dist`

is the OSA distance, `w_d`

is the cost of a deletion and `w_i`

is the cost of an insertion.

Normalization of the OSA distance/similarity to the unit interval
is also supported by setting `normalize = TRUE`

. The normalization approach
follows Yujian and Bo (2007), and ensures that the distance remains a metric
when the costs of insertion `w_i`

and deletion `w_d`

are equal.
The normalized distance `\mathrm{dist}_n`

is defined as

`\mathrm{dist}_n(x, y) = \frac{2 \mathrm{dist}(x, y)}{w_d |x| + w_i |y| + \mathrm{dist}(x, y)},`

and the normalized similarity `\mathrm{sim}_n`

is defined as

`\mathrm{sim}_n(x, y) = 1 - \mathrm{dist}_n(x, y) = \frac{\mathrm{sim}(x, y)}{w_d |x| + w_i |y| - \mathrm{sim}(x, y)}.`

### Value

An `OSA`

instance is returned, which is an S4 class inheriting from
`StringComparator`

.

### Note

If the costs of deletion and insertion are equal, this comparator is
symmetric in `x`

and `y`

. The OSA distance is not a proper metric
as it does not satisfy the triangle inequality. The Damerau-Levenshtein
distance is closely related—it allows the same edit operations as OSA,
but removes the requirement that no substring can be edited more than once.

### References

Boytsov, L. (2011), "Indexing methods for approximate dictionary searching:
Comparative analysis", *ACM J. Exp. Algorithmics* **16**,
Article 1.1.

Navarro, G. (2001), "A guided tour to approximate string matching",
*ACM Computing Surveys (CSUR)*, **33**(1), 31-88.

Yujian, L. & Bo, L. (2007), "A Normalized Levenshtein Distance Metric",
*IEEE Transactions on Pattern Analysis and Machine Intelligence*
**29**: 1091–1095.

### See Also

Other edit-based comparators include `Hamming`

, `LCS`

,
`Levenshtein`

and `DamerauLevenshtein`

.

### Examples

```
## Compare strings with a transposition error
x <- "plauge"; y <- "plague"
OSA()(x, y) != Levenshtein()(x, y)
## Unlike Damerau-Levenshtein, OSA does not allow a substring to be
## edited more than once
x <- "ABC"; y <- "CA"
OSA()(x, y) != DamerauLevenshtein()(x, y)
## Compare car names using normalized OSA similarity
data(mtcars)
cars <- rownames(mtcars)
pairwise(OSA(similarity = TRUE, normalize=TRUE), cars)
```

*comparator*version 0.1.2 Index]