R: Reduce Comparison Data Size

reduce_comparison_data {multilink}

R Documentation

Reduce Comparison Data Size

Description

Use indexing to reduce the number of record pairs that are potential matches.

Usage

reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1)

Arguments

`comparison_list`	The output of a call to `create_comparison_data`.
`pairs_to_keep`	A `logical` vector, the same length as `comparison_list$record_pairs`, indicating which record pairs should be kept as potential matches. These potential matches do not have to be transitive (see the argument `cc`).
`cc`	A `numeric` indicator of whether to find the transitive closure of `pairs_to_keep`, and use these potential matches instead of just those from `pairs_to_keep`. `cc` should be `1` if the transitive closure is being used, and `cc` should be `0` if the transitive closure is not being used. We recommend setting `cc` to `1`.

Details

When using comparison-based record linkage methods, scalability is a concern, as the number of record pairs is quadratic in the number of records. In order to address these concerns, it's common to declare certain record pairs to not be potential matches a priori, using indexing methods. The user is free to index using any method they like, as long as they can produce a logical vector that indicates which record pairs are potential matches according to their indexing method. We recommend, if the user chosen indexing method does not output potential matches that are transitive, to set the cc argument to 1. By transitive we mean, for any three records i, j, and k, if i and j are potential matches, and j and k are potential matches, then i and k are potential matches. Non-transitive indexing schemes can lead to poor mixing of the Gibbs sampler used for posterior inference, and suggests that the indexing method used may have been too stringent.

If indexing is used, it may be the case that some records are declared to not be potential matches to any other records. In this case, the indexing method has made the decision that these records have no matches, and thus we can remove them from the data set and relabel the remaining records; see the documentation for labels for information on how to go between the original labeling and the new labeling.

If indexing is used, comparisons for record pairs that aren't potential matches are still used during inference, where they're used to inform the distribution of comparisons for non-matches.

Value

a list containing:

record_pairs: A data.frame, where each row contains the pair of records being compared in the corresponding row of comparisons. The rows are sorted in ascending order according to the first column, with ties broken according to the second column in ascending order. For any given row, the first column is less than the second column, i.e. record_pairs[i, 1] < record_pairs[i, 2] for each row i. If according to pairs_to_keep there are records which are not potential matches to any other records, the remaining records are relabeled (see labels).
comparisons: A logical matrix, where each row contains the comparisons between the record pair in the corresponding row of record_pairs. Comparisons are in the same order as the columns of records, and are represented by L + 1 columns of TRUE/FALSE indicators, where L + 1 is the number of disagreement levels for the field based on breaks.
K: The number of files, assumed to be of class numeric.
file_sizes: A numeric vector of length K, indicating the size of each file. If according to pairs_to_keep there are records which are not potential matches to any other records, the remaining records are relabeled (see labels), and file_sizes now represents the sizes of each file after removing such records.
duplicates: A numeric vector of length K, indicating which files are assumed to have duplicates. duplicates[k] should be 1 if file k has duplicates, and duplicates[k] should be 0 if file k has no duplicates.
field_levels: A numeric vector indicating the number of disagreement levels for each field.
file_labels: An integer vector of length sum(file_sizes), where file_labels[i] indicated which file record i is in.
fp_matrix: An integer matrix, where fp_matrix[k1, k2] is a label for the file pair (k1, k2). Note that fp_matrix[k1, k2] = fp_matrix[k2, k1].
rp_to_fp: A logical matrix that indicates which record pairs belong to which file pairs. rp_to_fp[fp, rp] is TRUE if the records record_pairs[rp, ] belong to the file pair fp, and is FALSE otherwise. Note that fp is given by the labeling in fp_matrix.
ab: An integer vector, of length ncol(comparisons) * K * (K + 1) / 2 that indicates how many record pairs there are with a given disagreement level for a given field, for each file pair.
file_sizes_not_included: If according to pairs_to_keep there are records which are not potential matches to any other records, the remaining records are relabeled (see labels), and file_sizes_not_included indicates, for each file, the number of such records that were removed.
ab_not_included: For record pairs not included according to pairs_to_keep, this is an integer vector, of length ncol(comparisons) * K * (K + 1) / 2 that indicates how many record pairs there are with a given disagreement level for a given field, for each file pair.
labels: If according to pairs_to_keep there are records which are not potential matches to any other records, the remaining records are relabeled. labels provides a dictionary that indicates, for each of the new labels, which record in the original labeling the new label corresponds to. In particular, the first column indicates the record in the original labeling, and the second column indicates the new labeling.
pairs_to_keep: A logical vector, the same length as comparison_list$record_pairs, indicating which record pairs were kept as potential matches. This may not be the same as the input pairs_to_keep if cc was set to 1.
cc: A numeric indicator of whether the connected components of the potential matches are closed under transitivity.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi: 10.1080/01621459.2021.2013242][arXiv]

Examples

# Example with small duplicate dataset
data(dup_data_small)

# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
 types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
 breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),
               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA),
 file_sizes = dup_data_small$file_sizes,
 duplicates = c(1, 1, 1))

# Reduce the comparison data
# The following line corresponds to only keeping pairs of records for which
# neither gname nor fname disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
 (comparison_list$comparisons[, "fname_DL_3"] != TRUE)
reduced_comparison_list <- reduce_comparison_data(comparison_list,
 pairs_to_keep, cc = 1)

[Package multilink version 0.1.1 Index]