pair_minsim {reclin2}R Documentation

Generate pairs with a minimal similarity

Description

Generates all combinations of records from x and y where the blocking variables are equal.

Usage

pair_minsim(
  x,
  y,
  on,
  minsim = 0,
  on_blocking = character(0),
  comparators = list(default_comparator),
  default_comparator = cmp_identical(),
  keep_simsum = TRUE,
  deduplication = FALSE,
  add_xy = TRUE
)

Arguments

x

first data.frame

y

second data.frame. Ignored when deduplication = TRUE.

on

the variables defining on which the pairs of records from x and y are compared.

minsim

minimal similarity score.

on_blocking

variables for which the pairs have to match.

comparators

named list of functions with which the variables are compared. This function should accept two vectors. Function should either return a vector or a data.table with multiple columns.

default_comparator

variables for which no comparison function is defined using comparators is compares with the function default_comparator.

keep_simsum

add a variable minsim to the result with the similarity score of the pair.

deduplication

generate pairs from only x. Ignore y. This is usefull for deduplication of x.

add_xy

add x and y as attributes to the returned pairs. This makes calling some subsequent operations that need x and y (such as compare_pairs easier.

Details

Generating (all) pairs of the records of two data sets, is usually the first step when linking the two data sets. However, this often results in a too large number of records. pair_minsim will only keep pairs with a similarity score equal or larger than minsim. The similarity score is calculated by summing the results of the comparators for all variables of on.

Missing values in the variables on which the pairs are compared count as a similarity of 0.

Value

A data.table with two columns, .x and .y, is returned. Columns .x and .y are row numbers from data.frames .x and .y respectively.

See Also

pair and pair_blocking are other methods to generate pairs.

Examples

data("linkexample1", "linkexample2")
pairs <- pair_minsim(linkexample1, linkexample2, 
   on = c("postcode", "address"), minsim = 1)
# Either address or postcode has to match to keep a pair

data("linkexample1", "linkexample2")
pairs <- pair_minsim(linkexample1, linkexample2, on_blocking = "postcode",
   on = c("lastname", "firstname", "address"), minsim = 2)
# Postcode has to match; from lastname, firstname, address there have to match
# two or more (e.g. one mismatch is allowed).


[Package reclin2 version 0.5.0 Index]