| cluster_pair_minsim {reclin2} | R Documentation |
Generate pairs with a minimal similarity using multiple processes
Description
Generates all combinations of records from x and y where the
blocking variables are equal.
Usage
cluster_pair_minsim(
cluster,
x,
y,
on,
minsim = 0,
on_blocking = character(0),
comparators = list(default_comparator),
default_comparator = cmp_identical(),
keep_simsum = TRUE,
deduplication = FALSE,
name = "default"
)
Arguments
cluster |
a cluster object as created by |
x |
first |
y |
second |
on |
the variables defining the blocks or strata for which
all pairs of |
minsim |
minimal similarity score. |
on_blocking |
variables for which the pairs have to match. |
comparators |
named list of functions with which the variables are compared.
This function should accept two vectors. Function should either return a vector
or a |
default_comparator |
variables for which no comparison function is defined using
|
keep_simsum |
add a variable |
deduplication |
generate pairs from only |
name |
the name of the resulting object to create locally on the different R processes. |
Details
Generating (all) pairs of the records of two data sets, is usually the first
step when linking the two data sets. However, this often results in a too
large number of records. pair_minsim will only keep pairs with a
similarity score equal or larger than minsim. The similarity score is
calculated by summing the results of the comparators for all variables
of on.
x is split into length{cluster} parts which are distributed
over the worker nodes. y is copied to each of the nodes. On the nodes
then cluster_pair_minsim is called. The pairs are stored in the global
object reclin_env on the nodes in the variable name. The pairs
can then be further processes using functions such as
compare_pairs, and tabulate_patterns. The function
cluster_collect collects the pairs from each of the nodes.
Value
A object of type cluster_pairs which is a list containing the
cluster and the name of the pairs object on the cluster nodes. For the pairs
objects created on the nodes see the documentation of pair.
See Also
cluster_pair and cluster_pair_blocking are
other methods to generate pairs.
Examples
library(parallel)
data("linkexample1", "linkexample2")
cl <- makeCluster(2)
# Either address or postcode has to match to keep a pair
pairs <- cluster_pair_minsim(cl, linkexample1, linkexample2,
on = c("postcode", "address"), minsim = 1)
stopCluster(cl)