afind {stringdist} | R Documentation |
Stringdist-based fuzzy text search
Description
afind
slides a window of fixed width over a string x
and
computes the distance between the each window and the sought-after
pattern
. The location, content, and distance corresponding to the
window with the best match is returned.
Usage
afind(
x,
pattern,
window = NULL,
value = TRUE,
method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "running_cosine",
"jaccard", "jw", "soundex"),
useBytes = FALSE,
weight = c(d = 1, i = 1, s = 1, t = 1),
q = 1,
p = 0,
bt = 0,
nthread = getOption("sd_num_thread")
)
grab(x, pattern, maxDist = Inf, value = FALSE, ...)
grabl(x, pattern, maxDist = Inf, ...)
extract(x, pattern, maxDist = Inf, ...)
Arguments
x |
strings to search in |
pattern |
strings to find (not a regular expression). For |
window |
width of moving window. |
value |
toggle return matrix with matched strings. |
method |
Matching algorithm to use. See |
useBytes |
Perform byte-wise comparison. See |
weight |
For |
q |
q-gram size, only when method is |
p |
Winklers 'prefix' parameter for Jaro-Winkler distance, with
|
bt |
Winkler's boost threshold. Winkler's prefix factor is
only applied when the Jaro distance is larger than |
nthread |
Number of threads used by the underlying C-code. A sensible
default is chosen, see |
maxDist |
Only windows with distance |
... |
passed to |
Details
Matching is case-sensitive. Both x
and pattern
are converted
to UTF-8
prior to search, unless useBytes=TRUE
, in which case
the distances are measured bytewise.
Code is parallelized over the x
variable: each value of x
is scanned for every element in pattern
using a separate thread (when nthread
is larger than 1).
The functions grab
and grabl
are approximate string matching
functions that somewhat resemble base R's grep
and
grepl
. They are implemented as convenience wrappers
of afind
.
Value
For afind
: a list
of three matrices, each with
length(x)
rows and length(pattern)
columns. In each matrix,
element (i,j)
corresponds to x[i]
and pattern[j]
. The
names and description of each matrix is as follows.
location
.[integer]
, location of the start of best matching window. WhenuseBytes=FALSE
, this corresponds to the location of aUTF
code point inx
, possibly after conversion from its original encoding.distance
.[character]
, the string distance between pattern and the best matching window.match
.[character]
, the first, best matching window.
For grab
, an integer
vector, indicating in which elements of
x
a match was found with a distance <= maxDist
. The matched
values when value=TRUE
(equivalent to grep
).
For grabl
, a logical
vector, indicating in which elements of
x
a match was found with a distance <= maxDist
. (equivalent
to grepl
).
For extract
, a character
matrix with length(x)
rows and
length(pattern)
columns. If match was found, element (i,j)
contains the match, otherwise it is set to NA
.
Running cosine distance
This algorithm gains efficiency by using that two consecutive windows have
a large overlap in their q-gram profiles. It gives the same result as
the "cosine"
distance, but much faster.
See Also
Other matching:
amatch()
Examples
texts = c("When I grow up, I want to be"
, "one of the harvesters of the sea"
, "I think before my days are gone"
, "I want to be a fisherman")
patterns = c("fish", "gone","to be")
afind(texts, patterns, method="running_cosine", q=3)
grabl(texts,"grew", maxDist=1)
extract(texts, "harvested", maxDist=3)