R: Weighted token similarity measure

lev_weighted_token_ratio {levitate}

R Documentation

Weighted token similarity measure

Description

Computes similarity but allows you to assign weights to specific tokens. This is useful, for example, when you have a frequently-occurring string that doesn't contain useful information. See examples.

Usage

lev_weighted_token_ratio(a, b, weights = list(), ...)

Arguments

`a`, `b`	The input strings
`weights`	List of token weights. For example, `weights = list(foo = 0.9, bar = 0.1)`. Any tokens omitted from `weights` will be given a weight of 1.
`...`	Additional arguments to be passed to `stringdist::stringdistmatrix()` or `stringdist::stringsimmatrix()`.

Value

A float

Details

The algorithm used here is as follows:

Tokenise the input strings
Compute the edit distance between each pair of tokens
Compute the maximum edit distance between each pair of tokens
Apply any weights from the weights argument
Return 1 - (sum(weighted_edit_distances) / sum(weighted_max_edit_distance))

Examples

lev_weighted_token_ratio("jim ltd", "tim ltd")

lev_weighted_token_ratio("tim ltd", "jim ltd", weights = list(ltd = 0.1))

[Package levitate version 0.2.0 Index]