MongeElkan {comparator}R Documentation

Monge-Elkan Token Comparator

Description

Compares a pair of token sets xx and yy by computing similarity scores between all pairs of tokens using an internal string comparator, then taking the mean of the maximum scores for each token in xx.

Usage

MongeElkan(
  inner_comparator = Levenshtein(similarity = TRUE, normalize = TRUE),
  agg_function = base::mean,
  symmetrize = FALSE
)

Arguments

inner_comparator

internal string comparator of class StringComparator. Defaults to Levenshtein similarity.

agg_function

aggregation function to use when aggregating internal similarities/distances between tokens. Defaults to mean, however hmean may be a better choice when the comparator returns normalized similarity scores.

symmetrize

logical indicating whether to use a symmetrized version of the Monge-Elkan comparator. Defaults to FALSE.

Details

A token set is an unordered enumeration of tokens, which may include duplicates. Given two token sets xx and yy, the Monge-Elkan comparator is defined as:

ME(x,y)=1xi=1xmaxjsim(xi,yj)\mathrm{ME}(x, y) = \frac{1}{|x|} \sum_{i = 1}^{|x|} \max_j \mathrm{sim}(x_i, y_j)

where xix_i is the i-th token in xx, x|x| is the number of tokens in xx and sim\mathrm{sim} is an internal string similarity comparator.

A generalization of the original Monge-Elkan comparator is implemented here, which allows for distance comparators in place of similarity comparators, and/or more general aggregation functions in place of the arithmetic mean. The generalized Monge-Elkan comparator is defined as:

ME(x,y)=agg(optj inner(xi,yj))\mathrm{ME}(x, y) = \mathrm{agg}(\mathrm{opt}_j \ \mathrm{inner}(x_i, y_j))

where inner\mathrm{inner} is an internal distance or similarity comparator, opt\mathrm{opt} is max\max if inner\mathrm{inner} is a similarity comparator or min\min if it is a distance comparator, and agg\mathrm{agg} is an aggregation function which takes a vector of scores for each token in xx and returns a scalar.

By default, the Monge-Elkan comparator is asymmetric in its arguments xx and yy. If symmetrize = TRUE, a symmetric version of the comparator is obtained as follows

MEsym(x,y)=opt {ME(x,y),ME(y,x)}\mathrm{ME}_{sym}(x, y) = \mathrm{opt} \ \{\mathrm{ME}(x, y), \mathrm{ME}(y, x)\}

where opt\mathrm{opt} is defined above.

Value

A MongeElkan instance is returned, which is an S4 class inheriting from StringComparator.

References

Monge, A. E., & Elkan, C. (1996), "The Field Matching Problem: Algorithms and Applications", In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96), pp. 267-270.

Jimenez, S., Becerra, C., Gelbukh, A., & Gonzalez, F. (2009), "Generalized Monge-Elkan Method for Approximate Text String Comparison", In Computational Linguistics and Intelligent Text Processing, pp. 559-570.

Examples

## Compare names with heterogenous representations
x <- "The University of California - San Diego"
y <- "Univ. Calif. San Diego"
# Tokenize strings on white space
x <- strsplit(x, '\\s+')
y <- strsplit(y, '\\s+')
MongeElkan()(x, y)

## The symmetrized variant is arguably more appropriate for this example
MongeElkan(symmetrize = TRUE)(x, y) 

## Using a different internal comparator changes the result
MongeElkan(inner_comparator = BinaryComp(), symmetrize=TRUE)(x, y)


[Package comparator version 0.1.2 Index]