find_bayes_estimate {multilink} | R Documentation |
Find the Bayes Estimate of a Partition
Description
Find the (approximate) Bayes estimate of a partition based on MCMC samples of the partition and a specified loss function.
Usage
find_bayes_estimate(
partitions,
burn_in,
L_FNM = 1,
L_FM1 = 1,
L_FM2 = 2,
L_A = Inf,
max_cc_size = nrow(partitions),
verbose = TRUE
)
Arguments
partitions |
Posterior samples of the partition, where each column
is one sample and the partition is represented as an |
burn_in |
The number of samples to discard for burn in. |
L_FNM |
Positive loss for a false non-match. Default is |
L_FM1 |
Positive loss for a type 1 false match. Default is |
L_FM2 |
Positive loss for a type 2 false match. Default is |
L_A |
Positive loss for abstaining from making a decision for a record.
Default is |
max_cc_size |
The maximum allowable connected component size over which
the posterior expected loss is minimized. Default is |
verbose |
A |
Value
A vector, the same length of a column of partitions
containing the
(approximate) Bayes estimate of the partition. If !is.infinite(L_A)
the output may be a partial estimate. A positive number l
in index
i
indicates that record i
is in the same cluster as every other
record j
with l
in index j
. A value of -1
in
index i
indicates that the Bayes estimate abstained from making a
decision for record i
.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi: 10.1080/01621459.2021.2013242][arXiv]
Examples
# Example with small no duplicate dataset
data(no_dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(no_dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = no_dup_data_small$file_sizes,
duplicates = c(0, 0, 0))
# Specify the prior
prior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0,
alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1),
dup_count_prior_family = NA, dup_count_prior_pars = NA,
n_prior_family = "uniform", n_prior_pars = NA)
# Find initialization for the matching (this step is optional)
# The following line corresponds to only keeping pairs of records as
# potential matches in the initialization for which neither gname nor fname
# disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
(comparison_list$comparisons[, "fname_DL_3"] != TRUE)
Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42)
# Run the Gibbs sampler
results <- gibbs_sampler(comparison_list, prior_list, n_iter = 1000,
Z_init = Z_init, seed = 42)
# Find the full Bayes estimate
full_estimate <- find_bayes_estimate(results$partitions, burn_in = 100,
L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50)
# Find the partial Bayes estimate
partial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100,
L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12)
# Example with small duplicate dataset
data(dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = dup_data_small$file_sizes,
duplicates = c(1, 1, 1))
# Reduce the comparison data
# The following line corresponds to only keeping pairs of records for which
# neither gname nor fname disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
(comparison_list$comparisons[, "fname_DL_3"] != TRUE)
reduced_comparison_list <- reduce_comparison_data(comparison_list,
pairs_to_keep, cc = 1)
# Specify the prior
prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA,
flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10),
dup_count_prior_family = c("Poisson", "Poisson", "Poisson"),
dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform",
n_prior_pars = NA)
# Run the Gibbs sampler
results <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000,
seed = 42)
# Find the full Bayes estimate
full_estimate <- find_bayes_estimate(results$partitions, burn_in = 100,
L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50)
# Find the partial Bayes estimate
partial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100,
L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12)