relabel_bayes_estimate {multilink} | R Documentation |
Relabel the Bayes Estimate of a Partition
Description
Relabel the Bayes estimate of a partition, for use after using indexing to reduce the number of record pairs that are potential matches.
Usage
relabel_bayes_estimate(reduced_comparison_list, bayes_estimate)
Arguments
reduced_comparison_list |
The output from a call to
|
bayes_estimate |
The output from a call to
|
Details
When the function reduce_comparison_data
is used to reduce the
number of record pairs that are potential matches, it may be the case that
some records are declared to not be potential matches to any other records.
In this case, the indexing method has made the decision that these records
have no matches, and thus we can remove them from the data set and relabel
the remaining records; see the documentation for labels
in
reduce_comparison_data
for information on how to go between the
original labeling and the new labeling. The purpose of this function is to
relabel the output of find_bayes_estimate
when the function
reduce_comparison_data
is used, so that the user doesn't have
to do this relabeling themselves.
Value
A data.frame
, with as many rows as
sum(reduced_comparison_list$file_sizes +
reduced_comparison_list$file_sizes_not_included)
, i.e. the number of
records originally input to create_comparison_data
, before
indexing occurred. This data.frame
has two columns,
"original_labels"
and "link_id"
. Given row i
of
records
originally input to create_comparison_data
,
the linkage id according to bayes_estimate
is given by the i
th
row of the link_id
column. See the documentation for
find_bayes_estimate
for information on how to interpret this
linkage id.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi: 10.1080/01621459.2021.2013242][arXiv]
Examples
# Example with small duplicate dataset
data(dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = dup_data_small$file_sizes,
duplicates = c(1, 1, 1))
# Reduce the comparison data
# The following line corresponds to only keeping pairs of records for which
# neither gname nor fname disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
(comparison_list$comparisons[, "fname_DL_3"] != TRUE)
reduced_comparison_list <- reduce_comparison_data(comparison_list,
pairs_to_keep, cc = 1)
# Specify the prior
prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA,
flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10),
dup_count_prior_family = c("Poisson", "Poisson", "Poisson"),
dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform",
n_prior_pars = NA)
# Run the Gibbs sampler
{
results <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000,
seed = 42)
# Find the full Bayes estimate
full_estimate <- find_bayes_estimate(results$partitions, burn_in = 100,
L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50)
# Find the partial Bayes estimate
partial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100,
L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12)
# Relabel the full and partial Bayes estimates
full_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list,
full_estimate)
partial_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list,
partial_estimate)
# Add columns to the records corresponding to their full and partial
# Bayes estimates
dup_data_small$records <- cbind(dup_data_small$records,
full_estimate_id = full_estimate_relabel$link_id,
partial_estimate_id = partial_estimate_relabel$link_id)
}