assess_reference_mc {rubias} | R Documentation |
Partition a reference dataset and estimate reporting group and collection proportions
Description
From a reference dataset, this draws (without replacement) a simulated mixture dataset with randomly drawn population proportions, then uses this in two different estimates of population mixture proportions: maximum likelihood via EM-algorithm and posterior mean from MCMC.
Usage
assess_reference_mc(
reference,
gen_start_col,
reps = 50,
mixsize = 100,
seed = 5,
alpha_repunit = 1.5,
alpha_collection = 1.5,
min_remaining = 5,
alle_freq_prior = list(const_scaled = 1)
)
Arguments
reference |
a two-column format genetic dataset, with "repunit", "collection", and "indiv" columns, as well as a "sample_type" column that has some "reference" entries. |
gen_start_col |
the first column of genetic data in reference |
reps |
number of reps to do |
mixsize |
the number of individuals in each simulated mixture. |
seed |
a random seed for simulations |
alpha_repunit |
The dirichlet parameter for simulating the proportions of reporting units. Default = 1.5 |
alpha_collection |
The dirichlet parameter for simulating proportions of collections within reporting units. Default = 1.5 |
min_remaining |
the minimum number of individuals which should be conserved in each reference collection during sampling without replacement to form the simulated mixture |
alle_freq_prior |
a one-element named list specifying the prior to be used when
generating Dirichlet parameters for genotype likelihood calculations. Valid methods include
|
Details
This method is referred to as "Monte Carlo cross-validation".
The input parameters for assess_reference_mc
are more restrictive than those of
assess_reference_loo
. Rather than allowing a data.frame to specify Dirichlet
parameters, proportions, or counts for specific reporting units and collections,
assess_reference_mc
only allows vector input (default = 1.5) for alpha_repunit
and alpha_collection
. These inputs specify the uniform Dirichlet parameters for
all reporting units and collections, respectively.
For mixture proportion generation, the rho values are first drawn using a stick-breaking
model of the Dirichlet distribution, but with proportions capped by min_remaining
.
Stick-breaking is then used to subdivide each reporting unit into collections. In
addition to the constraint that mixture sampling without replacement cannot deplete
the number of individuals in each collection below min_remaining
, a similar
constraint is placed upon the number of individuals left in reporting units,
determined as min_remaining
* (# collections in reporting unit).
Note that this implies that the data are only truly Dirichlet distributed when no
rejections based on min_remaining
occur. This is a reasonable certainty with
sufficient reference collection sizes relative to the desired mixture size.
Examples
# only 5 reps, so it doesn't take too long. Typically you would
# do many more
ale_dev <- assess_reference_mc(alewife, 17, 5)