MongeElkan {comparator} | R Documentation |
Monge-Elkan Token Comparator
Description
Compares a pair of token sets and
by computing similarity
scores between all pairs of tokens using an internal string comparator,
then taking the mean of the maximum scores for each token in
.
Usage
MongeElkan(
inner_comparator = Levenshtein(similarity = TRUE, normalize = TRUE),
agg_function = base::mean,
symmetrize = FALSE
)
Arguments
inner_comparator |
internal string comparator of class
|
agg_function |
aggregation function to use when aggregating internal
similarities/distances between tokens. Defaults to |
symmetrize |
logical indicating whether to use a symmetrized version of the Monge-Elkan comparator. Defaults to FALSE. |
Details
A token set is an unordered enumeration of tokens, which may include
duplicates.
Given two token sets and
, the Monge-Elkan comparator is
defined as:
where is the i-th token in
,
is the
number of tokens in
and
is an internal
string similarity comparator.
A generalization of the original Monge-Elkan comparator is implemented here, which allows for distance comparators in place of similarity comparators, and/or more general aggregation functions in place of the arithmetic mean. The generalized Monge-Elkan comparator is defined as:
where is an internal distance or similarity
comparator,
is
if
is a similarity comparator or
if
it is a distance comparator, and
is an aggregation
function which takes a vector of scores for each token in
and
returns a scalar.
By default, the Monge-Elkan comparator is asymmetric in its arguments
and
. If
symmetrize = TRUE
, a symmetric version of the comparator
is obtained as follows
where is defined above.
Value
A MongeElkan
instance is returned, which is an S4 class inheriting from
StringComparator
.
References
Monge, A. E., & Elkan, C. (1996), "The Field Matching Problem: Algorithms and Applications", In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96), pp. 267-270.
Jimenez, S., Becerra, C., Gelbukh, A., & Gonzalez, F. (2009), "Generalized Monge-Elkan Method for Approximate Text String Comparison", In Computational Linguistics and Intelligent Text Processing, pp. 559-570.
Examples
## Compare names with heterogenous representations
x <- "The University of California - San Diego"
y <- "Univ. Calif. San Diego"
# Tokenize strings on white space
x <- strsplit(x, '\\s+')
y <- strsplit(y, '\\s+')
MongeElkan()(x, y)
## The symmetrized variant is arguably more appropriate for this example
MongeElkan(symmetrize = TRUE)(x, y)
## Using a different internal comparator changes the result
MongeElkan(inner_comparator = BinaryComp(), symmetrize=TRUE)(x, y)