R: Value merging based on ngram fingerprints

n_gram_merge {refinr}

R Documentation

Value merging based on ngram fingerprints

Description

This function takes a character vector and makes edits and merges values that are approximately equivalent yet not identical. It uses a two step process, the first is clustering values based on their ngram fingerprint (described here https://openrefine.org/docs/technical-reference/clustering-in-depth). The second step is merging values based on approximate string matching of the ngram fingerprints, using the [sd_lower_tri()] C function from the package stringdist.

Usage

n_gram_merge(
  vect,
  numgram = 2,
  ignore_strings = NULL,
  bus_suffix = TRUE,
  edit_threshold = 1,
  weight = c(d = 0.33, i = 0.33, s = 1, t = 0.5),
  ...
)

Arguments

`vect`	Character vector, items to be potentially clustered and merged.
`numgram`	Numeric value, indicating the number of characters that will occupy each ngram token. Default value is 2.
`ignore_strings`	Character vector, these strings will be ignored during the merging of values within `vect`. Default value is NULL.
`bus_suffix`	Logical, indicating whether the merging of records should be insensitive to common business suffixes or not. Default value is TRUE.
`edit_threshold`	Numeric value, indicating the threshold at which a merge is performed, based on the sum of the edit values derived from param `weight`. Default value is 1. If this parameter is set to 0 or NA, then no approximate string matching will be done, and all merging will be based on strings that have identical ngram fingerprints.
`weight`	Numeric vector, indicating the weights to assign to the four edit operations (see details below), for the purpose of approximate string matching. Default values are c(d = 0.33, i = 0.33, s = 1, t = 0.5). This parameter gets passed along to the `stringdist` function. Must be either a numeric vector of length four, or NA.
`...`	additional args to be passed along to the `stringdist` function. The acceptable args are identical to those of [stringdistmatrix()].

Details

The values of arg weight are edit distance values that get passed to the stringdist edit distance function. The param takes four arguments, each one is a specific type of edit, with default penalty value.

d: deletion, default value is 0.33
i: insertion, default value is 0.33
s: substitution, default value is 1
t: transposition, default value is 0.5

Value

Character vector with similar values merged.

Examples

x <- c("Acme Pizza, Inc.", "ACME PIZA COMPANY", "Acme Pizzazza LLC")

n_gram_merge(vect = x)

# The performance of the approximate string matching can be ajusted using
# parameters 'weight' or 'edit_threshold'
n_gram_merge(vect = x,
             weight = c(d = 0.4, i = 1, s = 1, t = 1))

# Use parameter 'ignore_strings' to ignore specific strings during merging
# of values.
x <- c("Bakersfield Highschool", "BAKERSFIELD high",
       "high school, bakersfield")
n_gram_merge(vect = x, ignore_strings = c("high", "school", "highschool"))

[Package refinr version 0.3.3 Index]