MongeElkan {comparator}  R Documentation 
Compares a pair of token sets x
and y
by computing similarity
scores between all pairs of tokens using an internal string comparator,
then taking the mean of the maximum scores for each token in x
.
MongeElkan(
inner_comparator = Levenshtein(similarity = TRUE, normalize = TRUE),
agg_function = base::mean,
symmetrize = FALSE
)
inner_comparator 
internal string comparator of class

agg_function 
aggregation function to use when aggregating internal
similarities/distances between tokens. Defaults to 
symmetrize 
logical indicating whether to use a symmetrized version of the MongeElkan comparator. Defaults to FALSE. 
A token set is an unordered enumeration of tokens, which may include
duplicates.
Given two token sets x
and y
, the MongeElkan comparator is
defined as:
\mathrm{ME}(x, y) = \frac{1}{x} \sum_{i = 1}^{x} \max_j \mathrm{sim}(x_i, y_j)
where x_i
is the ith token in x
, x
is the
number of tokens in x
and \mathrm{sim}
is an internal
string similarity comparator.
A generalization of the original MongeElkan comparator is implemented here, which allows for distance comparators in place of similarity comparators, and/or more general aggregation functions in place of the arithmetic mean. The generalized MongeElkan comparator is defined as:
\mathrm{ME}(x, y) = \mathrm{agg}(\mathrm{opt}_j \ \mathrm{inner}(x_i, y_j))
where \mathrm{inner}
is an internal distance or similarity
comparator, \mathrm{opt}
is \max
if
\mathrm{inner}
is a similarity comparator or \min
if
it is a distance comparator, and \mathrm{agg}
is an aggregation
function which takes a vector of scores for each token in x
and
returns a scalar.
By default, the MongeElkan comparator is asymmetric in its arguments x
and y
. If symmetrize = TRUE
, a symmetric version of the comparator
is obtained as follows
\mathrm{ME}_{sym}(x, y) = \mathrm{opt} \ \{\mathrm{ME}(x, y), \mathrm{ME}(y, x)\}
where \mathrm{opt}
is defined above.
A MongeElkan
instance is returned, which is an S4 class inheriting from
StringComparator
.
Monge, A. E., & Elkan, C. (1996), "The Field Matching Problem: Algorithms and Applications", In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96), pp. 267270.
Jimenez, S., Becerra, C., Gelbukh, A., & Gonzalez, F. (2009), "Generalized MongeElkan Method for Approximate Text String Comparison", In Computational Linguistics and Intelligent Text Processing, pp. 559570.
## Compare names with heterogenous representations
x < "The University of California  San Diego"
y < "Univ. Calif. San Diego"
# Tokenize strings on white space
x < strsplit(x, '\\s+')
y < strsplit(y, '\\s+')
MongeElkan()(x, y)
## The symmetrized variant is arguably more appropriate for this example
MongeElkan(symmetrize = TRUE)(x, y)
## Using a different internal comparator changes the result
MongeElkan(inner_comparator = BinaryComp(), symmetrize=TRUE)(x, y)