create_comparison_data {multilink} | R Documentation |
Create Comparison Data
Description
Create comparison data for all pairs of records, except for those records in files which are assumed to have no duplicates.
Usage
create_comparison_data(
records,
types,
breaks,
file_sizes,
duplicates,
verbose = TRUE
)
Arguments
records |
A |
types |
A |
breaks |
A |
file_sizes |
A |
duplicates |
A |
verbose |
A |
Details
The purpose of this function is to construct comparison vectors for each pair
of records. In order to construct these vectors, one needs to specify the
types
and breaks
arguments. The types
argument specifies
how each field should be compared, and the breaks
argument specifies
how to discretize these comparisons.
Currently, the types
argument supports three types of field
comparisons: binary, absolute difference, and the normalized Levenshtein
distance. Please contact the package maintainer if you need a new type of
comparison to be supported.
The breaks
argument should be a list
, with with one element for
each field. If a field is being compared with a binary comparison, i.e.
types[f]="bi"
, then the corresponding element of breaks
should
be NA
, i.e. breaks[[f]]=NA
. If a field is being compared with a
numeric or string comparison, then the corresponding element of breaks
should be a vector of cut points used to discretize the comparisons. To give
more detail, suppose you pass in cut points
breaks[[f]]=c(cut_1, ...,cut_L)
. These cut points
discretize the range of the comparisons into L+1
intervals:
I_0=(-\infty, cut_1], I_1=(cut_1, cut_2], ..., I_L=(cut_L, \infty]
. The
raw comparisons, which lie in [0,\infty)
for numeric comparisons and
[0,1]
for string comparisons, are then replaced with indicators of
which interval the comparisons lie in. The interval I_0
corresponds to
the lowest level of disagreement for a comparison, while the interval
I_L
corresponds to the highest level of disagreement for a comparison.
Value
a list containing:
record_pairs
A
data.frame
, where each row contains the pair of records being compared in the corresponding row ofcomparisons
. The rows are sorted in ascending order according to the first column, with ties broken according to the second column in ascending order. For any given row, the first column is less than the second column, i.e.record_pairs[i, 1] < record_pairs[i, 2]
for each rowi
.comparisons
A
logical
matrix, where each row contains the comparisons for the record pair in the corresponding row ofrecord_pairs
. Comparisons are in the same order as the columns ofrecords
, and are represented byL + 1
columns ofTRUE/FALSE
indicators, whereL + 1
is the number of disagreement levels for the field based onbreaks
.K
The number of files, assumed to be of class
numeric
.file_sizes
A
numeric
vector of lengthK
, indicating the size of each file.duplicates
A
numeric
vector of lengthK
, indicating which files are assumed to have duplicates.duplicates[k]
should be1
if filek
has duplicates, andduplicates[k]
should be0
if filek
has no duplicates. If any files do not have duplicates, we strongly recommend that the largest such file is organized to be the first file.field_levels
A
numeric
vector indicating the number of disagreement levels for each field.file_labels
An
integer
vector of lengthsum(file_sizes)
, wherefile_labels[i]
indicates which file recordi
is in.fp_matrix
An
integer
matrix, wherefp_matrix[k1, k2]
is a label for the file pair(k1, k2)
. Note thatfp_matrix[k1, k2] = fp_matrix[k2, k1]
.rp_to_fp
A
logical
matrix that indicates which record pairs belong to which file pairs.rp_to_fp[fp, rp]
isTRUE
if the recordsrecord_pairs[rp, ]
belong to the file pairfp
, and is FALSE otherwise. Note thatfp
is given by the labeling infp_matrix
.ab
An
integer
vector, of lengthncol(comparisons) * K * (K + 1) / 2
that indicates how many record pairs there are with a given disagreement level for a given field, for each file pair.file_sizes_not_included
A
numeric
vector of0
s. This element is non-zero whenreduce_comparison_data
is used.ab_not_included
A
numeric
vector of0
s. This element is non-zero whenreduce_comparison_data
is used.labels
NA
. This element is notNA
whenreduce_comparison_data
is used.pairs_to_keep
NA
. This element is notNA
whenreduce_comparison_data
is used.cc
0
. This element is non-zero whenreduce_comparison_data
is used.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi: 10.1080/01621459.2021.2013242][arXiv]
Examples
## Example with small no duplicate dataset
data(no_dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(no_dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = no_dup_data_small$file_sizes,
duplicates = c(0, 0, 0))
## Example with small duplicate dataset
data(dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = dup_data_small$file_sizes,
duplicates = c(1, 1, 1))