Hamming {comparator} | R Documentation |
Hamming String/Sequence Comparator
Description
The Hamming distance between two strings/sequences of equal length is the number of positions where the corresponding characters/sequence elements differ. It can be viewed as a type of edit distance where the only permitted operation is substitution of characters/sequence elements.
Usage
Hamming(
normalize = FALSE,
similarity = FALSE,
ignore_case = FALSE,
use_bytes = FALSE
)
Arguments
normalize |
a logical. If TRUE, distances/similarities are normalized to the unit interval. Defaults to FALSE. |
similarity |
a logical. If TRUE, similarity scores are returned instead of distances. Defaults to FALSE. |
ignore_case |
a logical. If TRUE, case is ignored when comparing strings. |
use_bytes |
a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character. |
Details
When the input strings/sequences x
and y
are of
different lengths (|x| \neq |y|
), the Hamming distance
is defined to be \infty
.
A Hamming similarity is returned if similarity = TRUE
. When
|x| = |y|
the similarity is defined as follows:
\mathrm{sim}(x, y) = |x| - \mathrm{dist}(x, y),
where sim
is the Hamming similarity and dist
is the Hamming
distance. When |x| \neq |y|
the similarity is defined to
be 0.
Normalization of the Hamming distance/similarity to the unit interval is
also supported by setting normalize = TRUE
. The raw distance/similarity
is divided by the length of the string/sequence |x| = |y|
. If
|x| \neq |y|
the normalized distance is defined to be 1,
while the normalized similarity is defined to be 0.
Value
A Hamming
instance is returned, which is an S4 class inheriting from
StringComparator
.
Note
While the unnormalized Hamming distance is a metric, the normalized variant is not as it does not satisfy the triangle inequality.
See Also
Other edit-based comparators include LCS
, Levenshtein
,
OSA
and DamerauLevenshtein
.
Examples
## Compare US ZIP codes
x <- "90001"
y <- "90209"
m1 <- Hamming() # unnormalized distance
m2 <- Hamming(similarity = TRUE, normalize = TRUE) # normalized similarity
m1(x, y)
m2(x, y)