FuzzyTokenSet {comparator} | R Documentation |
Fuzzy Token Set Comparator
Description
Compares a pair of token sets x
and y
by computing the
optimal cost of transforming x
into y
using single-token
operations (insertions, deletions and substitutions). The cost of
single-token operations is determined at the character-level using an
internal string comparator.
Usage
FuzzyTokenSet(
inner_comparator = Levenshtein(normalize = TRUE),
agg_function = base::mean,
deletion = 1,
insertion = 1,
substitution = 1
)
Arguments
inner_comparator |
inner string distance comparator of class
|
agg_function |
function used to aggregate the costs of the optimal
operations. Defaults to |
deletion |
non-negative weight associated with deletion of a token. Defaults to 1. |
insertion |
non-negative weight associated insertion of a token. Defaults to 1. |
substitution |
non-negative weight associated with substitution of a token. Defaults to 1. |
Details
A token set is an unordered enumeration of tokens, which may include
duplicates. Given two token sets x
and y
, this comparator
computes the optimal cost of transforming x
into y
using the
following single-token operations:
deleting a token
a
fromx
at costw_d \times \mathrm{inner}(a, "")
inserting a token
b
iny
at costw_i \times \mathrm{inner}("", b)
substituting a token
a
inx
for a tokenb
iny
at costw_s \times \mathrm{inner}(a, b)
where \mathrm{inner}
is an internal string comparator and
w_d, w_i, w_s
are non-negative weights, referred to as deletion
,
insertion
and substitution
in the parameter list. By default, the
mean cost of the optimal set of operations is returned. Other methods of
aggregating the costs are supported by specifying a non-default
agg_function
.
If the internal string comparator is a distance function, then the optimal set of operations minimize the cost. Otherwise, the optimal set of operations maximize the cost. The optimization problem is solved exactly using a linear sum assignment solver.
Note
This comparator is qualitatively similar to the MongeElkan
comparator, however it is arguably more principled, since it is formulated
as a cost optimization problem. It also offers more control over the costs
of missing tokens (by varying the deletion
and insertion
weights).
This is useful for comparing full names, when dropping a name (e.g.
middle name) shouldn't be severely penalized.
Examples
## Compare names with heterogenous representations
x <- "The University of California - San Diego"
y <- "Univ. Calif. San Diego"
# Tokenize strings on white space
x <- strsplit(x, '\\s+')
y <- strsplit(y, '\\s+')
FuzzyTokenSet()(x, y)
# Reduce the cost associated with missing words
FuzzyTokenSet(deletion = 0.5, insertion = 0.5)(x, y)
## Compare full name with abbreviated name, reducing the penalty
## for dropping parts of the name
fullname <- "JOSE ELIAS TEJADA BASQUES"
name <- "JOSE BASQUES"
# Tokenize strings on white space
fullname <- strsplit(fullname, '\\s+')
name <- strsplit(name, '\\s+')
comparator <- FuzzyTokenSet(deletion = 0.5)
comparator(fullname, name) < comparator(name, fullname) # TRUE