strcmp {RecordLinkage} | R Documentation |
String Metrics
Description
Functions for computation of the similarity between two strings.
Usage
jarowinkler(str1, str2, W_1=1/3, W_2=1/3, W_3=1/3, r=0.5)
levenshteinSim(str1, str2)
levenshteinDist(str1, str2)
Arguments
str1 , str2 |
Two character vectors to compare. |
W_1 , W_2 , W_3 |
Adjustable weights. |
r |
Maximum transposition radius. A fraction of the length of the shorter string. |
Details
String metrics compute a similarity value in the range [0,1]
for two strings, with 1 denoting the highest (usually equality) and 0 denoting the lowest degree of similarity. In the context of Record Linkage, string similarities can improve the discernibility between matches and non-matches.
jarowinkler
is an implementation of the algorithm by Jaro and Winkler (see references). For the meaning of W_1
, W_2
, W_3
and r
see the referenced article. For most applications, the default values are reasonable.
levenshteinDist
returns the Levenshtein distance, which cannot be directly used as a valid string comparator.
levenshteinSim
is a similarity function based on the Levenshtein distance, calculated by
1-\frac{\mathrm{d}(\mathit{str}_{1},\mathit{str}_{2})}{\max(A,B))}
, where \mathrm{d}
is the Levenshtein distance
function and A
and B
are the lengths of the strings.
Arguments str1
and str2
are expected to be of type
"character"
.
Non-alphabetical characters can be processed. Valid format combinations for
the arguments are:
Two arrays with the same dimensions.
Two vectors. The shorter one is recycled as necessary.
Value
A numeric vector with similarity values in the interval
[0,1]
. For levenshteinDist
, the edit distance as an
integer vector.
Note
String comparison is case-sensitive, which means that for example "R"
and "r"
have a similarity of 0. If this behaviour is undesired, strings should be normalized before processing.
Author(s)
Andreas Borg, Murat Sariyar
References
Winkler, W.E.: String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. In: Proceedings of the Section on Survey Research Methods, American Statistical Association (1990), S. 354–369.
Examples
# compare two strings:
jarowinkler("Andreas","Anreas")
# compare one string with several others:
levenshteinSim("Andreas",c("Anreas","Andeas"))
# compare two vectors of strings:
jarowinkler(c("Andreas","Borg"),c("Andreas","Bork"))