| specify_prior {multilink} | R Documentation |
Specify the Prior Distributions
Description
Specify the prior distributions for the m and u parameters of the
models for comparison data among matches and non-matches, and the partition.
Usage
specify_prior(
comparison_list,
mus = NA,
nus = NA,
flat = 0,
alphas = NA,
dup_upper_bound = NA,
dup_count_prior_family = NA,
dup_count_prior_pars = NA,
n_prior_family = NA,
n_prior_pars = NA
)
Arguments
comparison_list |
the output from a call to
|
mus, nus |
The hyperparameters of the Dirichlet priors for the |
flat |
A |
alphas |
The hyperparameters for the Dirichlet-multinomial overlap table
prior, a positive |
dup_upper_bound |
A |
dup_count_prior_family |
A |
dup_count_prior_pars |
A |
n_prior_family |
A |
n_prior_pars |
Currently set to |
Details
The purpose of this function is to specify prior distributions for all
parameters of the model. Please note that if
reduce_comparison_data is used to the reduce the number of
record pairs that are potential matches, then the output of
reduce_comparison_data (not
create_comparison_data) should be used as input.
For the hyperparameters of the Dirichlet priors for the m
and u parameters for the comparisons among matches and non-matches,
respectively, we recommend using a flat prior. This is accomplished by
setting mus=NA and nus=NA. Informative prior specifications
are possible, but in practice they will be overwhelmed by the large number of
comparisons.
For the prior for partitions, we do not recommend using a flat prior. Instead
we recommend using our structure prior for partitions. By setting
flat=0 and the remaining arguments to NA, one obtains the
default specification for the structured prior that we have found to perform
well in simulation studies. The structured prior for partitions is specified
as follows:
Specify a prior for
n, the number of clusters represented in the records. Note that this includes records determined not to be potential matches to any other records usingreduce_comparison_data. Currently, a uniform prior and a scale prior fornare supported. Our default specification uses a uniform prior.Specify a prior for the overlap table (see the documentation for
alphasfor more information). Currently a Dirichlet-multinomial prior is supported. Our default specification sets all hyperparameters of the Dirichlet-multinomial prior to1.For each file, specify a prior for the number of duplicates in each cluster. As a part of this prior, we specify the maximum number of records in a cluster for each file, through
dup_upper_bound. When there are assumed to be no duplicates in a file, the maximum number of records in a cluster for that file is set to1. When there are assumed to be duplicates in a file, we recommend setting the maximum number of records in a cluster for that file to be less than the file size, if prior knowledge allows. Currently, a Poisson prior for the the number of duplicates in each cluster is supported. Our default specification uses a Poisson prior with mean1.
Please contact the package maintainer if you need new prior families
for n or the number of duplicates in each cluster to be supported.
Value
a list containing:
musThe hyperparameters of the Dirichlet priors for the
mparameters for the comparisons among matches.nusThe hyperparameters of the Dirichlet priors for the
uparameters for the comparisons among non-matches. Includes data from comparisons of record pairs that were declared to not be potential matches usingreduce_comparison_data.flatA
numericindicator of whether a flat prior for partitions should be used.flatis1if a flat prior is used, andflatis0if a structured prior is used.no_dupsA
numericindicator of whether no duplicates are allowed in all of the files.alphasThe hyperparameters for the Dirichlet-multinomial overlap table prior, a positive
numericvector of length2 ^ comparison_list$K, where the first element is0.alpha_0The sum of
alphas.dup_upper_boundA
numericvector indicating the maximum number of duplicates, from each file, allowed in each cluster. For a given filek,dup_upper_bound[k]should be between1andcomparison_list$file_sizes[k], i.e. even if you don't want to impose an upper bound, you have to implicitly place an upper bound: the number of records in a file.log_dup_count_priorA
listcontaining the log density of the prior distribution for the number of duplicates in each cluster, for each file.log_n_priorA
numericvector containing the log density of the prior distribution for the number of clusters represented in the records.nus_specifiedThe
nusbefore data from comparisons of record pairs that were declared to not be potential matches usingreduce_comparison_dataare added. Used for input checking.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi: 10.1080/01621459.2021.2013242] [arXiv]
Examples
# Example with small no duplicate dataset
data(no_dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(no_dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = no_dup_data_small$file_sizes,
duplicates = c(0, 0, 0))
# Specify the prior
prior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0,
alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1),
dup_count_prior_family = NA, dup_count_prior_pars = NA,
n_prior_family = "uniform", n_prior_pars = NA)
# Example with small duplicate dataset
data(dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = dup_data_small$file_sizes,
duplicates = c(1, 1, 1))
# Reduce the comparison data
# The following line corresponds to only keeping pairs of records for which
# neither gname nor fname disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
(comparison_list$comparisons[, "fname_DL_3"] != TRUE)
reduced_comparison_list <- reduce_comparison_data(comparison_list,
pairs_to_keep, cc = 1)
# Specify the prior
prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA,
flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10),
dup_count_prior_family = c("Poisson", "Poisson", "Poisson"),
dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform",
n_prior_pars = NA)