infer_data_and_decision_model {baldur}R Documentation

Sample the Posterior of the data and decision model and generate point estimates



Function to sample the posterior of the Bayesian data and decision model. It first produces the needed inputs for Stan's sampling() for each peptide (or protein, PTM, etc.). It then runs the sampling for the data and decision model. From the posterior, it then collects estimates and sampling statistics from the posterior of data model and integrates the decision distribution D. It then returns a tibble() with all the information for each peptide's posterior (see Value below). There are major time gains to be made by running this procedure in parallel. infer_data_and_decision_model has an efficient wrapper around multidplyr. This will let you to evenly distribute all peptides evenly to each worker. E.g., two workers will each run half of the peptides in parallel.


  stan_model = empirical_bayes,
  clusters = 1,
  h_not = 0,



A tibble or data.frame with alpha,beta priors annotated


A character for the name of the column containing the name of the features in data (e.g., peptides, proteins, etc.)


A design matrix for the data. For the empirical_bayes prior only mean models are allowed (see example). For the weakly_informative prior more complicated design can be used.


A contrast matrix of the decisions to test. Columns should sum to 0 and only mean comparisons are allowed. That is, the absolute value of the positive and negative values in each column has to sum to 2. E.g., a column can be [0.5, 0.5, -1⁠]⁠^T but not [1, 1, -1⁠]⁠^T or [0.5, 0.5, -2⁠]⁠^T. That is, sum(abs(x))=2 where x is a column in the contrast matrix.


A matrix of observation specific uncertainties


Which Bayesian model to use. Defaults to empirical_bayes but also allows weakly_informative, or an user supplied function see [].


The number of parallel threads/workers to run on.


The value of the null hypothesis for the difference in means


Additional arguments passed to rstan::sampling. Note that verbose will always be forced to FALSE to avoid console flooding.


The data model of Baldur assumes that the observations of a peptide, \boldsymbol{Y}, is a normally distributed with a standard deviation, \sigma, common to all measurements. In addition, it assumes that each measurement has a unique uncertainty u. It then models all measurements in the same condition with a common mean \mu. It then assumes that the common variation of the peptide is caused by the variation in the \mu As such, it models \mu with the common variance \sigma and a non-centered parametrization for condition level effects.

\boldsymbol{Y}\sim\mathcal{N}(\boldsymbol{X}\boldsymbol{\mu},\sigma\boldsymbol{u})\quad \boldsymbol{\mu}\sim\mathcal{N}(\mu_0+\boldsymbol{\eta}\sigma,\sigma)

It then assumes \sigma to be gamma distributed with hyperparameters infered from either a gamma regression fit_gamma_regression or a latent gamma mixture regression fit_lgmr.


For details on the two priors for \mu_0 see empirical_bayes or weakly_informative. Baldur then builds a posterior distribution of the difference(s) in means for contrasts of interest. In addition, Baldur increases the precision of the decision as the number of measurements increase. This is done by weighting the sample size with the contrast matrix. To this end, Baldur limits the possible contrast of interest such that each column must sum to zero, and the absolute value of each column must sum to two. That is, only mean comparisons are allowed.

\boldsymbol{D}\sim\mathcal{N}(\boldsymbol{\mu}^\text{T}\boldsymbol{K},\sigma\boldsymbol{\xi}),\quad \xi_{i}=\sqrt{\sum_{c=1}^{C}|k_{cm}|n_c^{-1}}

where \boldsymbol{K} is a contrast matrix of interest and k_{cm} is the contrast of the c:th condition in the m:th contrast of interest, and n_c is the number of measurements in the c:th condition. Baldur then integrates the tails of \boldsymbol{D} to determine the probability of error.


where H_0 is the null hypothesis for the difference in means, \Phi is the CDF of the standard normal, \boldsymbol{\mu}_{\boldsymbol{D}} is the posterior mean of \boldsymbol{D}, \boldsymbol{\tau}_{\boldsymbol{D}} is the posterior precision of \boldsymbol{D}, and \odot is the Hadamard product.


A tibble() or data.frame() annotated with statistics of the posterior and p(error). err is the probability of error, i.e., the two tail-density supporting the null-hypothesis, lfc is the estimated \log_2-fold change, sigma is the common variance, and lp is the log-posterior. Columns without suffix shows the mean estimate from the posterior, while the suffixes ⁠_025⁠, ⁠_50⁠, and ⁠_975⁠, are the 2.5, 50.0, and 97.5, percentiles, respectively. The suffixes ⁠_eff⁠ and ⁠_rhat⁠ are the diagnostic variables returned by Stan. In general, a larger ⁠_eff⁠ indicates effective sample size and ⁠_rhat⁠ indicates the mixing within chains and between the chains and should be smaller than 1.05 (please see the Stan manual for more details). In addition it returns a column warnings with potential warnings given by the sampler.


# (Please see the vignette for a tutorial)
# Setup model matrix
design <- model.matrix(~ 0 + factor(rep(1:2, each = 3)))
colnames(design) <- paste0("ng", c(50, 100))

yeast_norm <- yeast %>%
# Remove missing data
  tidyr::drop_na() %>%
  # Normalize data
  psrn('identifier') %>%
  # Add mean-variance trends

# Fit the gamma regression
gam <- fit_gamma_regression(yeast_norm, sd ~ mean)

# Estimate each data point's uncertainty
unc <- estimate_uncertainty(gam, yeast_norm, 'identifier', design)

yeast_norm <- gam %>%
   # Add hyper-priors for sigma
# Setup contrast matrix
contrast <- matrix(c(-1, 1), 2)

yeast_norm %>%
  head() %>% # Just running a few for the example
    clusters = 1 # I highly recommend increasing the number of parallel workers/clusters
                 # this will greatly reduce the speed of running baldur

[Package baldur version 0.0.3 Index]