| reduce_comparison_data {multilink} | R Documentation |
Reduce Comparison Data Size
Description
Use indexing to reduce the number of record pairs that are potential matches.
Usage
reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1)
Arguments
comparison_list |
The output of a call to
|
pairs_to_keep |
A |
cc |
A |
Details
When using comparison-based record linkage methods, scalability is a concern,
as the number of record pairs is quadratic in the number of records. In
order to address these concerns, it's common to declare certain record pairs
to not be potential matches a priori, using indexing methods. The user is
free to index using any method they like, as long as they can produce a
logical vector that indicates which record pairs are potential matches
according to their indexing method. We recommend, if the user chosen indexing
method does not output potential matches that are transitive, to set the
cc argument to 1. By transitive we mean, for any three records
i, j, and k, if i and j are potential matches,
and j and k are potential matches, then i and k are
potential matches. Non-transitive indexing schemes can lead to poor mixing of
the Gibbs sampler used for posterior inference, and suggests that the
indexing method used may have been too stringent.
If indexing is used, it may be the case that some records are declared to not
be potential matches to any other records. In this case, the indexing method
has made the decision that these records have no matches, and thus we can
remove them from the data set and relabel the remaining records; see the
documentation for labels for information on how to go between the
original labeling and the new labeling.
If indexing is used, comparisons for record pairs that aren't potential matches are still used during inference, where they're used to inform the distribution of comparisons for non-matches.
Value
a list containing:
record_pairsA
data.frame, where each row contains the pair of records being compared in the corresponding row ofcomparisons. The rows are sorted in ascending order according to the first column, with ties broken according to the second column in ascending order. For any given row, the first column is less than the second column, i.e.record_pairs[i, 1] < record_pairs[i, 2]for each rowi. If according topairs_to_keepthere are records which are not potential matches to any other records, the remaining records are relabeled (seelabels).comparisonsA
logicalmatrix, where each row contains the comparisons between the record pair in the corresponding row ofrecord_pairs. Comparisons are in the same order as the columns ofrecords, and are represented byL + 1columns ofTRUE/FALSEindicators, whereL + 1is the number of disagreement levels for the field based onbreaks.KThe number of files, assumed to be of class
numeric.file_sizesA
numericvector of lengthK, indicating the size of each file. If according topairs_to_keepthere are records which are not potential matches to any other records, the remaining records are relabeled (seelabels), andfile_sizesnow represents the sizes of each file after removing such records.duplicatesA
numericvector of lengthK, indicating which files are assumed to have duplicates.duplicates[k]should be1if filekhas duplicates, andduplicates[k]should be0if filekhas no duplicates.field_levelsA
numericvector indicating the number of disagreement levels for each field.file_labelsAn
integervector of lengthsum(file_sizes), wherefile_labels[i]indicated which file recordiis in.fp_matrixAn
integermatrix, wherefp_matrix[k1, k2]is a label for the file pair(k1, k2). Note thatfp_matrix[k1, k2] = fp_matrix[k2, k1].rp_to_fpA
logicalmatrix that indicates which record pairs belong to which file pairs.rp_to_fp[fp, rp]isTRUEif the recordsrecord_pairs[rp, ]belong to the file pairfp, and is FALSE otherwise. Note thatfpis given by the labeling infp_matrix.abAn
integervector, of lengthncol(comparisons) * K * (K + 1) / 2that indicates how many record pairs there are with a given disagreement level for a given field, for each file pair.file_sizes_not_includedIf according to
pairs_to_keepthere are records which are not potential matches to any other records, the remaining records are relabeled (seelabels), andfile_sizes_not_includedindicates, for each file, the number of such records that were removed.ab_not_includedFor record pairs not included according to
pairs_to_keep, this is anintegervector, of lengthncol(comparisons) * K * (K + 1) / 2that indicates how many record pairs there are with a given disagreement level for a given field, for each file pair.labelsIf according to
pairs_to_keepthere are records which are not potential matches to any other records, the remaining records are relabeled.labelsprovides a dictionary that indicates, for each of the new labels, which record in the original labeling the new label corresponds to. In particular, the first column indicates the record in the original labeling, and the second column indicates the new labeling.pairs_to_keepA
logicalvector, the same length ascomparison_list$record_pairs, indicating which record pairs were kept as potential matches. This may not be the same as the inputpairs_to_keepifccwas set to 1.ccA
numericindicator of whether the connected components of the potential matches are closed under transitivity.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi: 10.1080/01621459.2021.2013242][arXiv]
Examples
# Example with small duplicate dataset
data(dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = dup_data_small$file_sizes,
duplicates = c(1, 1, 1))
# Reduce the comparison data
# The following line corresponds to only keeping pairs of records for which
# neither gname nor fname disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
(comparison_list$comparisons[, "fname_DL_3"] != TRUE)
reduced_comparison_list <- reduce_comparison_data(comparison_list,
pairs_to_keep, cc = 1)