initialize_partition {multilink} | R Documentation |
Initialize the Partition
Description
Generate an initialization for the partition in the case when it is assumed there are no duplicates in all files (so that the partition is a matching).
Usage
initialize_partition(comparison_list, pairs_to_keep, seed = NA)
Arguments
comparison_list |
the output from a call to
|
pairs_to_keep |
A |
seed |
The seed to use to generate the initialization. |
Details
When it is assumed that there are no duplicates in all files, and
reduce_comparison_data
is not used to reduce the number of
potential matches, the Gibbs sampler used for posterior inference may
experience slow mixing when using an initialization for the partition where
each record is in its own cluster (the default option for the Gibbs sampler).
The purpose of this function is to provide an alternative initialization
scheme.
To use this initialization scheme, the user passes in a logical
vector
that indicates which record pairs are potential matches according to an
indexing method (as in reduce_comparison_data
). Note that this
indexing is only used to generate the initialization, it is not used for
inference. The initialization scheme first finds the transitive closure of
the potential matches, which partitions the records into blocks. Within each
block of records, the scheme randomly selects a record from each file, and
these selected records are then placed in the same cluster for the partition
initialization. All other records are placed in their own clusters.
Value
an integer
vector of arbitrary labels of length
sum(comparison_list$file_sizes)
, giving an initialization for the
partition.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi: 10.1080/01621459.2021.2013242][arXiv]
Examples
# Example with small no duplicate dataset
data(no_dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(no_dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = no_dup_data_small$file_sizes,
duplicates = c(0, 0, 0))
# Find initialization for the matching
# The following line corresponds to only keeping pairs of records as
# potential matches in the initialization for which neither gname nor fname
# disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
(comparison_list$comparisons[, "fname_DL_3"] != TRUE)
Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42)