reduce_comparison_data {multilink} | R Documentation |
Reduce Comparison Data Size
Description
Use indexing to reduce the number of record pairs that are potential matches.
Usage
reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1)
Arguments
comparison_list |
The output of a call to
|
pairs_to_keep |
A |
cc |
A |
Details
When using comparison-based record linkage methods, scalability is a concern,
as the number of record pairs is quadratic in the number of records. In
order to address these concerns, it's common to declare certain record pairs
to not be potential matches a priori, using indexing methods. The user is
free to index using any method they like, as long as they can produce a
logical
vector that indicates which record pairs are potential matches
according to their indexing method. We recommend, if the user chosen indexing
method does not output potential matches that are transitive, to set the
cc
argument to 1
. By transitive we mean, for any three records
i
, j
, and k
, if i
and j
are potential matches,
and j
and k
are potential matches, then i
and k
are
potential matches. Non-transitive indexing schemes can lead to poor mixing of
the Gibbs sampler used for posterior inference, and suggests that the
indexing method used may have been too stringent.
If indexing is used, it may be the case that some records are declared to not
be potential matches to any other records. In this case, the indexing method
has made the decision that these records have no matches, and thus we can
remove them from the data set and relabel the remaining records; see the
documentation for labels
for information on how to go between the
original labeling and the new labeling.
If indexing is used, comparisons for record pairs that aren't potential matches are still used during inference, where they're used to inform the distribution of comparisons for non-matches.
Value
a list containing:
record_pairs
A
data.frame
, where each row contains the pair of records being compared in the corresponding row ofcomparisons
. The rows are sorted in ascending order according to the first column, with ties broken according to the second column in ascending order. For any given row, the first column is less than the second column, i.e.record_pairs[i, 1] < record_pairs[i, 2]
for each rowi
. If according topairs_to_keep
there are records which are not potential matches to any other records, the remaining records are relabeled (seelabels
).comparisons
A
logical
matrix, where each row contains the comparisons between the record pair in the corresponding row ofrecord_pairs
. Comparisons are in the same order as the columns ofrecords
, and are represented byL + 1
columns ofTRUE/FALSE
indicators, whereL + 1
is the number of disagreement levels for the field based onbreaks
.K
The number of files, assumed to be of class
numeric
.file_sizes
A
numeric
vector of lengthK
, indicating the size of each file. If according topairs_to_keep
there are records which are not potential matches to any other records, the remaining records are relabeled (seelabels
), andfile_sizes
now represents the sizes of each file after removing such records.duplicates
A
numeric
vector of lengthK
, indicating which files are assumed to have duplicates.duplicates[k]
should be1
if filek
has duplicates, andduplicates[k]
should be0
if filek
has no duplicates.field_levels
A
numeric
vector indicating the number of disagreement levels for each field.file_labels
An
integer
vector of lengthsum(file_sizes)
, wherefile_labels[i]
indicated which file recordi
is in.fp_matrix
An
integer
matrix, wherefp_matrix[k1, k2]
is a label for the file pair(k1, k2)
. Note thatfp_matrix[k1, k2] = fp_matrix[k2, k1]
.rp_to_fp
A
logical
matrix that indicates which record pairs belong to which file pairs.rp_to_fp[fp, rp]
isTRUE
if the recordsrecord_pairs[rp, ]
belong to the file pairfp
, and is FALSE otherwise. Note thatfp
is given by the labeling infp_matrix
.ab
An
integer
vector, of lengthncol(comparisons) * K * (K + 1) / 2
that indicates how many record pairs there are with a given disagreement level for a given field, for each file pair.file_sizes_not_included
If according to
pairs_to_keep
there are records which are not potential matches to any other records, the remaining records are relabeled (seelabels
), andfile_sizes_not_included
indicates, for each file, the number of such records that were removed.ab_not_included
For record pairs not included according to
pairs_to_keep
, this is aninteger
vector, of lengthncol(comparisons) * K * (K + 1) / 2
that indicates how many record pairs there are with a given disagreement level for a given field, for each file pair.labels
If according to
pairs_to_keep
there are records which are not potential matches to any other records, the remaining records are relabeled.labels
provides a dictionary that indicates, for each of the new labels, which record in the original labeling the new label corresponds to. In particular, the first column indicates the record in the original labeling, and the second column indicates the new labeling.pairs_to_keep
A
logical
vector, the same length ascomparison_list$record_pairs
, indicating which record pairs were kept as potential matches. This may not be the same as the inputpairs_to_keep
ifcc
was set to 1.cc
A
numeric
indicator of whether the connected components of the potential matches are closed under transitivity.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi: 10.1080/01621459.2021.2013242][arXiv]
Examples
# Example with small duplicate dataset
data(dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = dup_data_small$file_sizes,
duplicates = c(1, 1, 1))
# Reduce the comparison data
# The following line corresponds to only keeping pairs of records for which
# neither gname nor fname disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
(comparison_list$comparisons[, "fname_DL_3"] != TRUE)
reduced_comparison_list <- reduce_comparison_data(comparison_list,
pairs_to_keep, cc = 1)