pair_minsim {reclin2} | R Documentation |
Generate pairs with a minimal similarity
Description
Generates all combinations of records from x
and y
where the
blocking variables are equal.
Usage
pair_minsim(
x,
y,
on,
minsim = 0,
on_blocking = character(0),
comparators = list(default_comparator),
default_comparator = cmp_identical(),
keep_simsum = TRUE,
deduplication = FALSE,
add_xy = TRUE
)
Arguments
x |
first |
y |
second |
on |
the variables defining on which the pairs of records from |
minsim |
minimal similarity score. |
on_blocking |
variables for which the pairs have to match. |
comparators |
named list of functions with which the variables are compared.
This function should accept two vectors. Function should either return a vector
or a |
default_comparator |
variables for which no comparison function is defined using
|
keep_simsum |
add a variable |
deduplication |
generate pairs from only |
add_xy |
add |
Details
Generating (all) pairs of the records of two data sets, is usually the first
step when linking the two data sets. However, this often results in a too
large number of records. pair_minsim
will only keep pairs with a
similarity score equal or larger than minsim
. The similarity score is
calculated by summing the results of the comparators for all variables
of on
.
Missing values in the variables on which the pairs are compared count as a similarity of 0.
Value
A data.table
with two columns,
.x
and .y
, is returned. Columns .x
and .y
are
row numbers from data.frame
s .x
and .y
respectively.
See Also
pair
and pair_blocking
are other methods
to generate pairs.
Examples
data("linkexample1", "linkexample2")
pairs <- pair_minsim(linkexample1, linkexample2,
on = c("postcode", "address"), minsim = 1)
# Either address or postcode has to match to keep a pair
data("linkexample1", "linkexample2")
pairs <- pair_minsim(linkexample1, linkexample2, on_blocking = "postcode",
on = c("lastname", "firstname", "address"), minsim = 2)
# Postcode has to match; from lastname, firstname, address there have to match
# two or more (e.g. one mismatch is allowed).