gibbs_sampler {multilink} | R Documentation |
Gibbs Sampler for Posterior Inference
Description
Run a Gibbs sampler to explore the posterior distribution of partitions of records.
Usage
gibbs_sampler(
comparison_list,
prior_list,
n_iter = 2000,
Z_init = 1:sum(comparison_list$file_sizes),
seed = 70,
single_likelihood = FALSE,
chaperones_info = NA,
verbose = TRUE
)
Arguments
comparison_list |
The output from a call to
|
prior_list |
The output from a call to |
n_iter |
The number of iterations of the Gibbs sampler to run. |
Z_init |
Initialization of the partition of records, represented as an
|
seed |
The seed to use while running the Gibbs sampler. |
single_likelihood |
A |
chaperones_info |
If |
verbose |
A |
Details
Given the prior specified using specify_prior
, this function
runs a Gibbs sampler to explore the posterior distribution of partitions of
records, conditional on the comparison data created using
create_comparison_data
or reduce_comparison_data
.
Value
a list containing:
m
Posterior samples of the
m
parameters. Each column is one sample.u
Posterior samples of the
u
parameters. Each column is one sample.partitions
Posterior samples of the partition. Each column is one sample. Note that the partition is represented as an
integer
vector of arbitrary labels of lengthsum(comparison_list$file_sizes)
.contingency_tables
Posterior samples of the overlap table. Each column is one sample. This incorporates counts of records determined not to be candidate matches to any other records using
reduce_comparison_data
.cluster_sizes
Posterior samples of the size of each cluster (associated with an arbitrary label from
1
tosum(comparison_list$file_sizes)
). Each column is one sample.sampling_time
The time in seconds it took to run the sampler.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi: 10.1080/01621459.2021.2013242][arXiv]
Jeffrey Miller, Brenda Betancourt, Abbas Zaidi, Hanna Wallach, & Rebecca C. Steorts (2015). Microclustering: When the cluster sizes grow sublinearly with the size of the data set. NeurIPS Bayesian Nonparametrics: The Next Generation Workshop Series. [arXiv]
Brenda Betancourt, Giacomo Zanella, Jeffrey Miller, Hanna Wallach, Abbas Zaidi, & Rebecca C. Steorts (2016). Flexible Models for Microclustering with Application to Entity Resolution. Advances in neural information processing systems. [Published] [arXiv]
Examples
# Example with small no duplicate dataset
data(no_dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(no_dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = no_dup_data_small$file_sizes,
duplicates = c(0, 0, 0))
# Specify the prior
prior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0,
alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1),
dup_count_prior_family = NA, dup_count_prior_pars = NA,
n_prior_family = "uniform", n_prior_pars = NA)
# Find initialization for the matching (this step is optional)
# The following line corresponds to only keeping pairs of records as
# potential matches in the initialization for which neither gname nor fname
# disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
(comparison_list$comparisons[, "fname_DL_3"] != TRUE)
Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42)
# Run the Gibbs sampler
{
results <- gibbs_sampler(comparison_list, prior_list, n_iter = 1000,
Z_init = Z_init, seed = 42)
}
# Example with small duplicate dataset
data(dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = dup_data_small$file_sizes,
duplicates = c(1, 1, 1))
# Reduce the comparison data
# The following line corresponds to only keeping pairs of records for which
# neither gname nor fname disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
(comparison_list$comparisons[, "fname_DL_3"] != TRUE)
reduced_comparison_list <- reduce_comparison_data(comparison_list,
pairs_to_keep, cc = 1)
# Specify the prior
prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA,
flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10),
dup_count_prior_family = c("Poisson", "Poisson", "Poisson"),
dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform",
n_prior_pars = NA)
# Run the Gibbs sampler
{
results <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000,
seed = 42)
}