getPairs {RecordLinkage}R Documentation

Extract Record Pairs

Description

Extracts record pairs from data and result objects.

Usage

## S4 method for signature 'RecLinkData'
getPairs(object, max.weight = Inf, min.weight = -Inf,
         single.rows = FALSE, show = "all", sort = !is.null(object$Wdata))

## S4 method for signature 'RLBigData'
getPairs(object, max.weight = Inf, min.weight = -Inf,
    filter.match = c("match", "unknown", "nonmatch"),
    withWeight = hasWeights(object), withMatch = TRUE, single.rows = FALSE,
    sort = withWeight)

## S4 method for signature 'RLResult'
getPairs(object, filter.match = c("match", "unknown", "nonmatch"),
    filter.link = c("nonlink", "possible", "link"), max.weight = Inf, 
    min.weight = -Inf, withMatch = TRUE, withClass = TRUE, 
    withWeight = hasWeights(object@data), single.rows = FALSE, sort = withWeight)

getFalsePos(object, single.rows = FALSE)
getFalseNeg(object, single.rows = FALSE)
getFalse(object, single.rows = FALSE)

Arguments

object

The data or result object from which to extract record pairs.

max.weight, min.weight

Real numbers. Upper and lower weight threshold.

filter.match

Character vector, a nonempty subset of c("match", "nonmatch", "unkown") denoting which pairs to allow in the output.

filter.link

Character vector, a nonempty subset of c("link", "nonlink", "unkown") denoting which pairs to allow in the output.

withWeight

Logical. Whether to include linkage weights in the output.

withMatch

Logical. Whether to include matching status in the output.

withClass

Logical. Whether to include classification result in the output.

single.rows

Logical. Whether to print record pairs in one row instead of two consecutive rows.

show

Character. Selects which records to show, one of "links", "nonlinks", "possible", "all".

sort

Logical. Whether to sort descending by weight.

Details

These methods extract record pairs from "RecLinkData", or "RecLinkResult", "RLBigData" and "RLResult" objects. Possible applications are retrieving a linkage result for further processing, conducting a manual review in order to determine classification thresholds or inspecting misclassified pairs.

The various arguments can be grouped by the following purposes:

  1. Controlling which record pairs are included in the output: min.weight and max.weight, filter.match, filter.link, show.

  2. Controlling which information is shown: withWeight, withMatch, withClass

  3. Controlling the overall structure of the result: sort, single.rows.

The weight limits are inclusive, i.e. a record pair with weight w is included only if
w >= min.weight && w <= max.weight.

If single.rows is not TRUE, pairs are output on two consecutive lines in a more readable format. All data are converted to character, which can lead to a loss of precision for numeric values. Therefore, this format should be used for printing only.

getFalsePos, getFalseNeg and getFalse are shortcuts (currently for objects of class "RLResult" only) to retrieve false positives (links that are non-matches in fact), false negatives (non-links that are matches in fact) or all falsely classified pairs, respectively.

Value

A data frame. If single.rows is TRUE, each row holds (in this order) id and data fields of the first record, id and data fields of the second record and possibly matching status, classification result and/or weight.

If single.rows is not TRUE, the result holds for each resulting record pair consecutive rows of the following format:

  1. ID and data fields of the first record followed by as many empty fields to match the length of the following line.

  2. ID and data fields of the second record, possibly followed by matching status, classification result and/or weight.

  3. A blank line to separate record pairs.

Note

When non-matches are included in the output and blocking is permissive, the result object can be very large, possibly leading to memory problems.

Author(s)

Andreas Borg, Murat Sariyar

Examples

data(RLdata500)

# create record pairs and calculate epilink weights
rpairs <- RLBigDataDedup(RLdata500, identity = identity.RLdata500,
  blockfld=list(1,3,5,6,7))
rpairs <- epiWeights(rpairs)

# show all record pairs with weights between 0.5 and 0.6
getPairs(rpairs, min.weight=0.5, max.weight=0.6)

# show only matches with weight <= 0.5
getPairs(rpairs, max.weight=0.5, filter.match="match")

# classify with one threshold
result <- epiClassify(rpairs, 0.5)

# show all links, do not show classification in the output
getPairs(result, filter.link="link", withClass = FALSE)

# see wrongly classified pairs
getFalsePos(result)
getFalseNeg(result)

[Package RecordLinkage version 0.4-12.4 Index]