infer_data_and_decision_model {baldur} | R Documentation |
Function to sample the posterior of the Bayesian data and
decision model. It first produces the needed inputs for Stan's sampling()
for each peptide (or protein, PTM, etc.). It then runs the sampling for the
data and decision model. From the posterior, it then collects estimates and
sampling statistics from the posterior of data model and integrates the
decision distribution D. It then returns a tibble()
with all the
information for each peptide's posterior (see Value below). There are
major time gains to be made by running this procedure in parallel.
infer_data_and_decision_model
has an efficient wrapper around
multidplyr
. This will let you to evenly distribute all peptides evenly to
each worker. E.g., two workers will each run half of the peptides in
parallel.
infer_data_and_decision_model(
data,
id_col,
design_matrix,
contrast_matrix,
uncertainty_matrix,
stan_model = empirical_bayes,
clusters = 1,
h_not = 0,
...
)
data |
A |
id_col |
A character for the name of the column containing the name of the features in data (e.g., peptides, proteins, etc.) |
design_matrix |
A design matrix for the data. For the empirical_bayes prior only mean models are allowed (see example). For the weakly_informative prior more complicated design can be used. |
contrast_matrix |
A contrast matrix of the decisions to test. Columns
should sum to |
uncertainty_matrix |
A matrix of observation specific uncertainties |
stan_model |
Which Bayesian model to use. Defaults to empirical_bayes but also allows weakly_informative, or an user supplied function see []. |
clusters |
The number of parallel threads/workers to run on. |
h_not |
The value of the null hypothesis for the difference in means |
... |
Additional arguments passed to
|
The data model of Baldur assumes that the observations of a peptide,
\boldsymbol{Y}
, is a normally distributed with a standard deviation,
\sigma
, common to all measurements. In addition, it assumes that each
measurement has a unique uncertainty u
. It then models all
measurements in the same condition with a common mean \mu
. It then
assumes that the common variation of the peptide is caused by the variation
in the \mu
As such, it models \mu
with the common variance
\sigma
and a non-centered parametrization for condition level
effects.
\boldsymbol{Y}\sim\mathcal{N}(\boldsymbol{X}\boldsymbol{\mu},\sigma\boldsymbol{u})\quad
\boldsymbol{\mu}\sim\mathcal{N}(\mu_0+\boldsymbol{\eta}\sigma,\sigma)
It then assumes \sigma
to be gamma distributed with hyperparameters
infered from either a gamma regression fit_gamma_regression or a latent
gamma mixture regression fit_lgmr.
\sigma\sim\Gamma(\alpha,\beta)
For details on the two priors for \mu_0
see empirical_bayes or
weakly_informative. Baldur then builds a posterior distribution of the
difference(s) in means for contrasts of interest. In addition, Baldur
increases the precision of the decision as the number of measurements
increase. This is done by weighting the sample size with the contrast
matrix. To this end, Baldur limits the possible contrast of interest such
that each column must sum to zero, and the absolute value of each column
must sum to two. That is, only mean comparisons are allowed.
\boldsymbol{D}\sim\mathcal{N}(\boldsymbol{\mu}^\text{T}\boldsymbol{K},\sigma\boldsymbol{\xi}),\quad \xi_{i}=\sqrt{\sum_{c=1}^{C}|k_{cm}|n_c^{-1}}
where \boldsymbol{K}
is a contrast matrix of interest and
k_{cm}
is the contrast of the c:th condition in the m:th contrast of
interest, and n_c
is the number of measurements in the c:th
condition. Baldur then integrates the tails of \boldsymbol{D}
to
determine the probability of error.
P(\text{\textbf{error}})=2\Phi(-\left|\boldsymbol{\mu}_{\boldsymbol{D}}-H_0\right|\odot\boldsymbol{\tau}_{\boldsymbol{D}})
where H_0
is the null hypothesis for the difference in means,
\Phi
is the CDF of the standard normal,
\boldsymbol{\mu}_{\boldsymbol{D}}
is the posterior mean of
\boldsymbol{D}
, \boldsymbol{\tau}_{\boldsymbol{D}}
is the
posterior precision of \boldsymbol{D}
, and \odot
is the
Hadamard product.
A tibble()
or data.frame()
annotated with statistics of the
posterior and p(error). err
is the probability of error, i.e., the two
tail-density supporting the null-hypothesis, lfc
is the estimated
\log_2
-fold change, sigma
is the common variance, and lp
is the
log-posterior. Columns without suffix shows the mean estimate from the
posterior, while the suffixes _025
, _50
, and _975
, are the 2.5, 50.0,
and 97.5, percentiles, respectively. The suffixes _eff
and _rhat
are
the diagnostic variables returned by Stan
. In general, a larger _eff
indicates effective sample size and _rhat
indicates the mixing within
chains and between the chains and should be smaller than 1.05 (please see
the Stan manual for more details). In addition it returns a column
warnings
with potential warnings given by the sampler.
# (Please see the vignette for a tutorial)
# Setup model matrix
design <- model.matrix(~ 0 + factor(rep(1:2, each = 3)))
colnames(design) <- paste0("ng", c(50, 100))
yeast_norm <- yeast %>%
# Remove missing data
tidyr::drop_na() %>%
# Normalize data
psrn('identifier') %>%
# Add mean-variance trends
calculate_mean_sd_trends(design)
# Fit the gamma regression
gam <- fit_gamma_regression(yeast_norm, sd ~ mean)
# Estimate each data point's uncertainty
unc <- estimate_uncertainty(gam, yeast_norm, 'identifier', design)
yeast_norm <- gam %>%
# Add hyper-priors for sigma
estimate_gamma_hyperparameters(yeast_norm)
# Setup contrast matrix
contrast <- matrix(c(-1, 1), 2)
yeast_norm %>%
head() %>% # Just running a few for the example
infer_data_and_decision_model(
'identifier',
design,
contrast,
unc,
clusters = 1 # I highly recommend increasing the number of parallel workers/clusters
# this will greatly reduce the speed of running baldur
)