| seq_dist {stringdist} | R Documentation |
Compute distance metrics between integer sequences
Description
seq_dist computes pairwise string distances between elements of
a and b, where the argument with less elements is recycled.
seq_distmatrix computes the distance matrix with rows according to
a and columns according to b.
Usage
seq_dist(
a,
b,
method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw"),
weight = c(d = 1, i = 1, s = 1, t = 1),
q = 1,
p = 0,
bt = 0,
nthread = getOption("sd_num_thread")
)
seq_distmatrix(
a,
b,
method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw"),
weight = c(d = 1, i = 1, s = 1, t = 1),
q = 1,
p = 0,
bt = 0,
useNames = c("names", "none"),
nthread = getOption("sd_num_thread")
)
Arguments
a |
( |
b |
( |
method |
Distance metric. See |
weight |
For |
q |
Size of the |
p |
Prefix factor for Jaro-Winkler distance. The valid range for
|
bt |
Winkler's boost threshold. Winkler's prefix factor is
only applied when the Jaro distance is larger than |
nthread |
Maximum number of threads to use. By default, a sensible
number of threads is chosen, see |
useNames |
label the output matrix with |
Value
seq_dist returns a numeric vector with pairwise distances between a
and b of length max(length(a),length(b).
For seq_distmatrix there are two options. If b is missing, the
dist object corresponding to the length(a) X
length(a) distance matrix is returned. If b is specified, the
length(a) X length(b) distance matrix is returned.
If any element of a or b is NA_integer_, the distance with
any matched integer vector will result in NA. Missing values in the sequences
themselves are treated as a number and not treated specially (Also see the examples).
Notes
Input vectors are converted with as.integer. This causes truncation for numeric
vectors (e.g. pi will be treated as 3L).
See Also
seq_sim, seq_amatch, seq_qgrams
Examples
# Distances between lists of integer vectors. Note the postfix 'L' to force
# integer storage. The shorter argument is recycled over (\code{a})
a <- list(c(102L, 107L)) # fu
b <- list(c(102L,111L,111L),c(102L,111L,111L)) # foo, fo
seq_dist(a,b)
# translate strings to a list of integer sequences
a <- lapply(c("foo","bar","baz"),utf8ToInt)
seq_distmatrix(a)
# Note how missing values are treated. NA's as part of the sequence are treated
# as an integer (the representation of NA_integer_).
a <- list(NA_integer_,c(102L, 107L))
b <- list(c(102L,111L,111L),c(102L,111L,NA_integer_))
seq_dist(a,b)
## Not run:
# Distance between sentences based on word order. Note: words must match exactly or they
# are treated as completely different.
#
# For this example you need to have the 'hashr' package installed.
a <- "Mary had a little lamb"
a.words <- strsplit(a,"[[:blank:]]+")
a.int <- hashr::hash(a.words)
b <- c("a little lamb had Mary",
"had Mary a little lamb")
b.int <- hashr::hash(strsplit(b,"[[:blank:]]+"))
seq_dist(a.int,b.int)
## End(Not run)