Hamming {comparator}R Documentation

Hamming String/Sequence Comparator

Description

The Hamming distance between two strings/sequences of equal length is the number of positions where the corresponding characters/sequence elements differ. It can be viewed as a type of edit distance where the only permitted operation is substitution of characters/sequence elements.

Usage

Hamming(
  normalize = FALSE,
  similarity = FALSE,
  ignore_case = FALSE,
  use_bytes = FALSE
)

Arguments

normalize

a logical. If TRUE, distances/similarities are normalized to the unit interval. Defaults to FALSE.

similarity

a logical. If TRUE, similarity scores are returned instead of distances. Defaults to FALSE.

ignore_case

a logical. If TRUE, case is ignored when comparing strings.

use_bytes

a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character.

Details

When the input strings/sequences x and y are of different lengths (|x| \neq |y|), the Hamming distance is defined to be \infty.

A Hamming similarity is returned if similarity = TRUE. When |x| = |y| the similarity is defined as follows:

\mathrm{sim}(x, y) = |x| - \mathrm{dist}(x, y),

where sim is the Hamming similarity and dist is the Hamming distance. When |x| \neq |y| the similarity is defined to be 0.

Normalization of the Hamming distance/similarity to the unit interval is also supported by setting normalize = TRUE. The raw distance/similarity is divided by the length of the string/sequence |x| = |y|. If |x| \neq |y| the normalized distance is defined to be 1, while the normalized similarity is defined to be 0.

Value

A Hamming instance is returned, which is an S4 class inheriting from StringComparator.

Note

While the unnormalized Hamming distance is a metric, the normalized variant is not as it does not satisfy the triangle inequality.

See Also

Other edit-based comparators include LCS, Levenshtein, OSA and DamerauLevenshtein.

Examples

## Compare US ZIP codes
x <- "90001"
y <- "90209"
m1 <- Hamming()                                     # unnormalized distance
m2 <- Hamming(similarity = TRUE, normalize = TRUE)  # normalized similarity
m1(x, y)
m2(x, y)


[Package comparator version 0.1.2 Index]