avg_and_regularize {bakR}R Documentation

Efficiently average replicates of nucleotide recoding data and regularize

Description

avg_and_regularize pools and regularizes replicate estimates of kinetic parameters. There are two key steps in this downstream analysis. 1st, the uncertainty for each feature is used to fit a linear ln(uncertainty) vs. log10(read depth) trend, and uncertainties for individual features are shrunk towards the regression line. The uncertainty for each feature is a combination of the Fisher Information asymptotic uncertainty as well as the amount of variability seen between estimates. Regularization of uncertainty estimates is performed using the analytic results of a Normal distribution likelihood with known mean and unknown variance and conjugate priors. The prior parameters are estimated from the regression and amount of variability about the regression line. The strength of regularization can be tuned by adjusting the prior_weight parameter, with larger numbers yielding stronger shrinkage towards the regression line. The 2nd step is to regularize the average kdeg estimates. This is done using the analytic results of a Normal distribution likelihood model with unknown mean and known variance and conjugate priors. The prior parameters are estimated from the population wide kdeg distribution (using its mean and standard deviation as the mean and standard deviation of the normal prior). In the 1st step, the known mean is assumed to be the average kdeg, averaged across replicates and weighted by the number of reads mapping to the feature in each replicate. In the 2nd step, the known variance is assumed to be that obtained following regularization of the uncertainty estimates.

Usage

avg_and_regularize(
  Mut_data_est,
  nreps,
  sample_lookup,
  feature_lookup,
  nbin = NULL,
  NSS = FALSE,
  Chase = FALSE,
  BDA_model = FALSE,
  null_cutoff = 0,
  Mutrates = NULL,
  ztest = FALSE
)

Arguments

Mut_data_est

Dataframe with fraction new estimation information. Required columns are:

  • fnum; numerical ID of feature

  • reps; numerical ID of replicate

  • mut; numerical ID of experimental condition (Exp_ID)

  • logit_fn_rep; logit(fn) estimate

  • kd_rep_est; kdeg estimate

  • log_kd_rep_est; log(kdeg) estimate

  • logit_fn_se; logit(fn) estimate uncertainty

  • log_kd_se; log(kdeg) estimate uncertainty

nreps

Vector of number of replicates in each experimental condition

sample_lookup

Dictionary mapping sample names to various experimental details

feature_lookup

Dictionary mapping feature IDs to original feature names

nbin

Number of bins for mean-variance relationship estimation. If NULL, max of 10 or (number of logit(fn) estimates)/100 is used

NSS

Logical; if TRUE, logit(fn)s are compared rather than log(kdeg) so as to avoid steady-state assumption.

Chase

Logical; Set to TRUE if analyzing a pulse-chase experiment. If TRUE, kdeg = -ln(fn)/tl where fn is the fraction of reads that are s4U (more properly referred to as the fraction old in the context of a pulse-chase experiment)

BDA_model

Logical; if TRUE, variance is regularized with scaled inverse chi-squared model. Otherwise a log-normal model is used.

null_cutoff

bakR will test the null hypothesis of |effect size| < |null_cutoff|

Mutrates

List containing new and old mutation rate estimates

ztest

TRUE; if TRUE, then a z-test is used for p-value calculation rather than the more conservative moderated t-test.

Details

Effect sizes (changes in kdeg) are obtained as the difference in log(kdeg) means between the reference and experimental sample(s), and the log(kdeg)s are assumed to be independent so that the variance of the effect size is the sum of the log(kdeg) variances. P-values assessing the significance of the effect size are obtained using a moderated t-test with number of degrees of freedom determined from the uncertainty regression hyperparameters and are adjusted for multiple testing using the Benjamini- Hochberg procedure to control false discovery rates (FDRs).

In some cases, the assumed ODE model of RNA metabolism will not accurately model the dynamics of a biological system being analyzed. In these cases, it is best to compare logit(fraction new)s directly rather than converting fraction new to log(kdeg). This analysis strategy is implemented when NSS is set to TRUE. Comparing logit(fraction new) is only valid If a single metabolic label time has been used for all samples. For example, if a label time of 1 hour was used for NR-seq data from WT cells and a 2 hour label time was used in KO cells, this comparison is no longer valid as differences in logit(fraction new) could stem from differences in kinetics or label times.

Value

List with dataframes providing information about replicate-specific and pooled analysis results. The output includes:


[Package bakR version 1.0.0 Index]