stringsim {stringdist} | R Documentation |
Compute similarity scores between strings
Description
stringsim
computes pairwise string similarities between elements of
character
vectors a
and b
, where the vector with less
elements is recycled.
stringsimmatrix
computes the string similarity matrix with rows
according to a
and columns according to b
.
Usage
stringsim(
a,
b,
method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
"soundex"),
useBytes = FALSE,
q = 1,
...
)
stringsimmatrix(
a,
b,
method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
"soundex"),
useBytes = FALSE,
q = 1,
...
)
Arguments
a |
R object (target); will be converted by |
b |
R object (source); will be converted by |
method |
Method for distance calculation. The default is |
useBytes |
Perform byte-wise comparison, see |
q |
Size of the |
... |
additional arguments are passed on to |
Details
The similarity is calculated by first calculating the distance using
stringdist
, dividing the distance by the maximum
possible distance, and substracting the result from 1.
This results in a score between 0 and 1, with 1
corresponding to complete similarity and 0 to complete dissimilarity.
Note that complete similarity only means equality for distances satisfying
the identity property. This is not the case e.g. for q-gram based distances
(for example if q=1, anagrams are completely similar).
For distances where weights can be specified, the maximum distance
is currently computed by assuming that all weights are equal to 1.
Value
stringsim
returns a vector with similarities, which are values between
0 and 1 where 1 corresponds to perfect similarity (distance 0) and 0 to
complete dissimilarity. NA
is returned when stringdist
returns NA
. Distances equal to Inf
are truncated to a
similarity of 0. stringsimmatrix
works the same way but, equivalent to
stringdistmatrix
, returns a similarity matrix instead of a
vector.
Examples
# Calculate the similarity using the default method of optimal string alignment
stringsim("ca", "abc")
# Calculate the similarity using the Jaro-Winkler method
# The p argument is passed on to stringdist
stringsim('MARTHA','MATHRA',method='jw', p=0.1)