| pair_minsim {reclin2} | R Documentation |
Generate pairs with a minimal similarity
Description
Generates all combinations of records from x and y where the
blocking variables are equal.
Usage
pair_minsim(
x,
y,
on,
minsim = 0,
on_blocking = character(0),
comparators = list(default_comparator),
default_comparator = cmp_identical(),
keep_simsum = TRUE,
deduplication = FALSE,
add_xy = TRUE
)
Arguments
x |
first |
y |
second |
on |
the variables defining on which the pairs of records from |
minsim |
minimal similarity score. |
on_blocking |
variables for which the pairs have to match. |
comparators |
named list of functions with which the variables are compared.
This function should accept two vectors. Function should either return a vector
or a |
default_comparator |
variables for which no comparison function is defined using
|
keep_simsum |
add a variable |
deduplication |
generate pairs from only |
add_xy |
add |
Details
Generating (all) pairs of the records of two data sets, is usually the first
step when linking the two data sets. However, this often results in a too
large number of records. pair_minsim will only keep pairs with a
similarity score equal or larger than minsim. The similarity score is
calculated by summing the results of the comparators for all variables
of on.
Missing values in the variables on which the pairs are compared count as a similarity of 0.
Value
A data.table with two columns,
.x and .y, is returned. Columns .x and .y are
row numbers from data.frames .x and .y respectively.
See Also
pair and pair_blocking are other methods
to generate pairs.
Examples
data("linkexample1", "linkexample2")
pairs <- pair_minsim(linkexample1, linkexample2,
on = c("postcode", "address"), minsim = 1)
# Either address or postcode has to match to keep a pair
data("linkexample1", "linkexample2")
pairs <- pair_minsim(linkexample1, linkexample2, on_blocking = "postcode",
on = c("lastname", "firstname", "address"), minsim = 2)
# Postcode has to match; from lastname, firstname, address there have to match
# two or more (e.g. one mismatch is allowed).