R: Fuzzy Token Set Comparator

FuzzyTokenSet {comparator}

R Documentation

Fuzzy Token Set Comparator

Description

Compares a pair of token sets x and y by computing the optimal cost of transforming x into y using single-token operations (insertions, deletions and substitutions). The cost of single-token operations is determined at the character-level using an internal string comparator.

Usage

FuzzyTokenSet(
  inner_comparator = Levenshtein(normalize = TRUE),
  agg_function = base::mean,
  deletion = 1,
  insertion = 1,
  substitution = 1
)

Arguments

`inner_comparator`	inner string distance comparator of class `StringComparator`. Defaults to normalized `Levenshtein` distance.
`agg_function`	function used to aggregate the costs of the optimal operations. Defaults to `base::mean`.
`deletion`	non-negative weight associated with deletion of a token. Defaults to 1.
`insertion`	non-negative weight associated insertion of a token. Defaults to 1.
`substitution`	non-negative weight associated with substitution of a token. Defaults to 1.

Details

A token set is an unordered enumeration of tokens, which may include duplicates. Given two token sets x and y, this comparator computes the optimal cost of transforming x into y using the following single-token operations:

deleting a token a from x at cost w_d \times \mathrm{inner}(a, "")
inserting a token b in y at cost w_i \times \mathrm{inner}("", b)
substituting a token a in x for a token b in y at cost w_s \times \mathrm{inner}(a, b)

where \mathrm{inner} is an internal string comparator and w_d, w_i, w_s are non-negative weights, referred to as deletion, insertion and substitution in the parameter list. By default, the mean cost of the optimal set of operations is returned. Other methods of aggregating the costs are supported by specifying a non-default agg_function.

If the internal string comparator is a distance function, then the optimal set of operations minimize the cost. Otherwise, the optimal set of operations maximize the cost. The optimization problem is solved exactly using a linear sum assignment solver.

Note

This comparator is qualitatively similar to the MongeElkan comparator, however it is arguably more principled, since it is formulated as a cost optimization problem. It also offers more control over the costs of missing tokens (by varying the deletion and insertion weights). This is useful for comparing full names, when dropping a name (e.g. middle name) shouldn't be severely penalized.

Examples

## Compare names with heterogenous representations
x <- "The University of California - San Diego"
y <- "Univ. Calif. San Diego"
# Tokenize strings on white space
x <- strsplit(x, '\\s+')
y <- strsplit(y, '\\s+')
FuzzyTokenSet()(x, y)
# Reduce the cost associated with missing words
FuzzyTokenSet(deletion = 0.5, insertion = 0.5)(x, y)

## Compare full name with abbreviated name, reducing the penalty 
## for dropping parts of the name
fullname <- "JOSE ELIAS TEJADA BASQUES"
name <- "JOSE BASQUES"
# Tokenize strings on white space
fullname <- strsplit(fullname, '\\s+')
name <- strsplit(name, '\\s+')
comparator <- FuzzyTokenSet(deletion = 0.5)
comparator(fullname, name) < comparator(name, fullname) # TRUE

[Package comparator version 0.1.2 Index]