n_gram_merge {refinr} | R Documentation |
Value merging based on ngram fingerprints
Description
This function takes a character vector and makes edits and merges values
that are approximately equivalent yet not identical. It uses a two step
process, the first is clustering values based on their ngram fingerprint (described here
https://openrefine.org/docs/technical-reference/clustering-in-depth).
The second step is merging values based on approximate string matching of
the ngram fingerprints, using the [sd_lower_tri()] C function from the
package stringdist
.
Usage
n_gram_merge(
vect,
numgram = 2,
ignore_strings = NULL,
bus_suffix = TRUE,
edit_threshold = 1,
weight = c(d = 0.33, i = 0.33, s = 1, t = 0.5),
...
)
Arguments
vect |
Character vector, items to be potentially clustered and merged. |
numgram |
Numeric value, indicating the number of characters that will occupy each ngram token. Default value is 2. |
ignore_strings |
Character vector, these strings will be ignored during
the merging of values within |
bus_suffix |
Logical, indicating whether the merging of records should be insensitive to common business suffixes or not. Default value is TRUE. |
edit_threshold |
Numeric value, indicating the threshold at which a
merge is performed, based on the sum of the edit values derived from
param |
weight |
Numeric vector, indicating the weights to assign to
the four edit operations (see details below), for the purpose of
approximate string matching. Default values are
c(d = 0.33, i = 0.33, s = 1, t = 0.5). This parameter gets passed along
to the |
... |
additional args to be passed along to the |
Details
The values of arg weight
are edit distance values that
get passed to the stringdist
edit distance function. The
param takes four arguments, each one is a specific type of edit, with
default penalty value.
d: deletion, default value is 0.33
i: insertion, default value is 0.33
s: substitution, default value is 1
t: transposition, default value is 0.5
Value
Character vector with similar values merged.
Examples
x <- c("Acme Pizza, Inc.", "ACME PIZA COMPANY", "Acme Pizzazza LLC")
n_gram_merge(vect = x)
# The performance of the approximate string matching can be ajusted using
# parameters 'weight' or 'edit_threshold'
n_gram_merge(vect = x,
weight = c(d = 0.4, i = 1, s = 1, t = 1))
# Use parameter 'ignore_strings' to ignore specific strings during merging
# of values.
x <- c("Bakersfield Highschool", "BAKERSFIELD high",
"high school, bakersfield")
n_gram_merge(vect = x, ignore_strings = c("high", "school", "highschool"))