R: Deselect pairs that are linked to multiple records

select_unique.cluster_pairs {reclin2}

R Documentation

Deselect pairs that are linked to multiple records

Description

Deselect pairs that are linked to multiple records

Usage

## S3 method for class 'cluster_pairs'
select_unique(
  pairs,
  variable,
  preselect = NULL,
  n = 1,
  m = 1,
  id_x = NULL,
  id_y = NULL,
  ...
)

select_unique(
  pairs,
  variable,
  preselect = NULL,
  n = 1,
  m = 1,
  id_x = NULL,
  id_y = NULL,
  ...
)

## S3 method for class 'pairs'
select_unique(
  pairs,
  variable,
  preselect = NULL,
  n = 1,
  m = 1,
  id_x = NULL,
  id_y = NULL,
  x = attr(pairs, "x"),
  y = attr(pairs, "y"),
  inplace = FALSE,
  ...
)

Arguments

`pairs`	a `pairs` object, such as generated by `pair_blocking`
`variable`	the name of the new variable to create in pairs. This will be a logical variable with a value of `TRUE` for the selected pairs.
`preselect`	a logical variable with the same length as `pairs` has rows, or the name of such a variable in `pairs`. Pairs are only selected when `preselect` is `TRUE`.
`n`	do not select pairs with a y-record that is linked to more than `n` records.
`m`	do not select pairs with a m-record that is linked to more than `m` records.
`id_x`	a integer vector with the same length as the number of rows in `pairs`, or the name of a column in `x`. This vector should identify unique objects in `x`. When not specified it is assumed that each element in `x` is unique.
`id_y`	a integer vector with the same length as the number of rows in `pairs`, or the name of a column in `y`. This vector should identify unique objects in `y`. When not specified it is assumed that each element in `y` is unique.
`...`	Used to pass additional arguments to methods
`x`	`data.table` with one half of the pairs.
`y`	`data.table` with the other half of the pairs.
`inplace`	logical indicating whether `pairs` should be modified in place. When pairs is large this can be more efficient.

Details

This function can be used to remove pairs for which there is ambiguity. For example when a record from x is linked to multiple records from y and we know that there are no duplicate records in y (records that belong to the same object), then we know that at least on of the two links is incorrect but we cannot decide which of the two. In that case we may want to decide that we will not link both records. Running select_unique with m == 1 will remove both records.

In case one wants to select one of the records randomly: select_greedy will select the pair with the highest weight and in case of an equal weight the first. Adding a random component to the weights will ensure a random selection.

Value

Returns the pairs with the variable given by variable added. This is a logical variable indicating which pairs are selected as matches.

Examples

data("linkexample1", "linkexample2")
pairs <- pair_blocking(linkexample1, linkexample2, "postcode")
compare_pairs(pairs, on = c("lastname", "firstname", "address", "sex"),
  default_comparator = jaro_winkler(0.9), inplace = TRUE)
score_simple(pairs, "score", 
  on = c("lastname", "firstname", "address", "sex"),
  w1 = list(lastname = 2), inplace = TRUE)
select_threshold(pairs, variable = "select", 
  score = "score", threshold = 4.0, inplace =  TRUE)
select_unique(pairs, variable = "select_unique", preselect = "select")

[Package reclin2 version 0.5.0 Index]