specify_prior {multilink} | R Documentation |
Specify the Prior Distributions
Description
Specify the prior distributions for the m
and u
parameters of the
models for comparison data among matches and non-matches, and the partition.
Usage
specify_prior(
comparison_list,
mus = NA,
nus = NA,
flat = 0,
alphas = NA,
dup_upper_bound = NA,
dup_count_prior_family = NA,
dup_count_prior_pars = NA,
n_prior_family = NA,
n_prior_pars = NA
)
Arguments
comparison_list |
the output from a call to
|
mus , nus |
The hyperparameters of the Dirichlet priors for the |
flat |
A |
alphas |
The hyperparameters for the Dirichlet-multinomial overlap table
prior, a positive |
dup_upper_bound |
A |
dup_count_prior_family |
A |
dup_count_prior_pars |
A |
n_prior_family |
A |
n_prior_pars |
Currently set to |
Details
The purpose of this function is to specify prior distributions for all
parameters of the model. Please note that if
reduce_comparison_data
is used to the reduce the number of
record pairs that are potential matches, then the output of
reduce_comparison_data
(not
create_comparison_data
) should be used as input.
For the hyperparameters of the Dirichlet priors for the m
and u
parameters for the comparisons among matches and non-matches,
respectively, we recommend using a flat prior. This is accomplished by
setting mus=NA
and nus=NA
. Informative prior specifications
are possible, but in practice they will be overwhelmed by the large number of
comparisons.
For the prior for partitions, we do not recommend using a flat prior. Instead
we recommend using our structure prior for partitions. By setting
flat=0
and the remaining arguments to NA
, one obtains the
default specification for the structured prior that we have found to perform
well in simulation studies. The structured prior for partitions is specified
as follows:
Specify a prior for
n
, the number of clusters represented in the records. Note that this includes records determined not to be potential matches to any other records usingreduce_comparison_data
. Currently, a uniform prior and a scale prior forn
are supported. Our default specification uses a uniform prior.Specify a prior for the overlap table (see the documentation for
alphas
for more information). Currently a Dirichlet-multinomial prior is supported. Our default specification sets all hyperparameters of the Dirichlet-multinomial prior to1
.For each file, specify a prior for the number of duplicates in each cluster. As a part of this prior, we specify the maximum number of records in a cluster for each file, through
dup_upper_bound
. When there are assumed to be no duplicates in a file, the maximum number of records in a cluster for that file is set to1
. When there are assumed to be duplicates in a file, we recommend setting the maximum number of records in a cluster for that file to be less than the file size, if prior knowledge allows. Currently, a Poisson prior for the the number of duplicates in each cluster is supported. Our default specification uses a Poisson prior with mean1
.
Please contact the package maintainer if you need new prior families
for n
or the number of duplicates in each cluster to be supported.
Value
a list containing:
mus
The hyperparameters of the Dirichlet priors for the
m
parameters for the comparisons among matches.nus
The hyperparameters of the Dirichlet priors for the
u
parameters for the comparisons among non-matches. Includes data from comparisons of record pairs that were declared to not be potential matches usingreduce_comparison_data
.flat
A
numeric
indicator of whether a flat prior for partitions should be used.flat
is1
if a flat prior is used, andflat
is0
if a structured prior is used.no_dups
A
numeric
indicator of whether no duplicates are allowed in all of the files.alphas
The hyperparameters for the Dirichlet-multinomial overlap table prior, a positive
numeric
vector of length2 ^ comparison_list$K
, where the first element is0
.alpha_0
The sum of
alphas
.dup_upper_bound
A
numeric
vector indicating the maximum number of duplicates, from each file, allowed in each cluster. For a given filek
,dup_upper_bound[k]
should be between1
andcomparison_list$file_sizes[k]
, i.e. even if you don't want to impose an upper bound, you have to implicitly place an upper bound: the number of records in a file.log_dup_count_prior
A
list
containing the log density of the prior distribution for the number of duplicates in each cluster, for each file.log_n_prior
A
numeric
vector containing the log density of the prior distribution for the number of clusters represented in the records.nus_specified
The
nus
before data from comparisons of record pairs that were declared to not be potential matches usingreduce_comparison_data
are added. Used for input checking.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi: 10.1080/01621459.2021.2013242] [arXiv]
Examples
# Example with small no duplicate dataset
data(no_dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(no_dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = no_dup_data_small$file_sizes,
duplicates = c(0, 0, 0))
# Specify the prior
prior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0,
alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1),
dup_count_prior_family = NA, dup_count_prior_pars = NA,
n_prior_family = "uniform", n_prior_pars = NA)
# Example with small duplicate dataset
data(dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = dup_data_small$file_sizes,
duplicates = c(1, 1, 1))
# Reduce the comparison data
# The following line corresponds to only keeping pairs of records for which
# neither gname nor fname disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
(comparison_list$comparisons[, "fname_DL_3"] != TRUE)
reduced_comparison_list <- reduce_comparison_data(comparison_list,
pairs_to_keep, cc = 1)
# Specify the prior
prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA,
flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10),
dup_count_prior_family = c("Poisson", "Poisson", "Poisson"),
dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform",
n_prior_pars = NA)