fast_analysis {bakR} | R Documentation |

`fast_analysis`

analyzes nucleotide recoding data with maximum likelihood estimation
implemented by `stats::optim`

combined with analytic solutions to simple Bayesian models to perform
approximate partial pooling. Output includes kinetic parameter estimates in each replicate, kinetic parameter estimates
averaged across replicates, and log-2 fold changes in the degradation rate constant (L2FC(kdeg)).
Averaging takes into account uncertainties estimated using the Fisher Information and estimates
are regularized using analytic solutions of fully Bayesian models. The result is that kdegs are
shrunk towards population means and that uncertainties are shrunk towards a mean-variance trend estimated as part of the analysis.

```
fast_analysis(
df,
pnew = NULL,
pold = NULL,
no_ctl = FALSE,
read_cut = 50,
features_cut = 50,
nbin = NULL,
prior_weight = 2,
MLE = TRUE,
ztest = FALSE,
lower = -7,
upper = 7,
se_max = 2.5,
mut_reg = 0.1,
p_mean = 0,
p_sd = 1,
StanRate = FALSE,
Stan_data = NULL,
null_cutoff = 0,
NSS = FALSE,
Chase = FALSE,
BDA_model = FALSE,
multi_pold = FALSE,
Long = FALSE,
kmeans = FALSE,
Fisher = TRUE
)
```

`df` |
Dataframe in form provided by cB_to_Fast |

`pnew` |
Labeled read mutation rate; default of 0 means that model estimates rate from s4U fed data. If pnew is provided by user, must be a vector of length == number of s4U fed samples. The 1st element corresponds to the s4U induced mutation rate estimate for the 1st replicate of the 1st experimental condition; the 2nd element corresponds to the s4U induced mutation rate estimate for the 2nd replicate of the 1st experimental condition, etc. |

`pold` |
Unlabeled read mutation rate; default of 0 means that model estimates rate from no-s4U fed data |

`no_ctl` |
Logical; if TRUE, then -s4U control is not used for background mutation rate estimation |

`read_cut` |
Minimum number of reads for a given feature-sample combo to be used for mut rate estimates |

`features_cut` |
Number of features to estimate sample specific mutation rate with |

`nbin` |
Number of bins for mean-variance relationship estimation. If NULL, max of 10 or (number of logit(fn) estimates)/100 is used |

`prior_weight` |
Determines extent to which logit(fn) variance is regularized to the mean-variance regression line |

`MLE` |
Logical; if TRUE then replicate logit(fn) is estimated using maximum likelihood; if FALSE more conservative Bayesian hypothesis testing is used |

`ztest` |
TRUE; if TRUE, then a z-test is used for p-value calculation rather than the more conservative moderated t-test. |

`lower` |
Lower bound for MLE with L-BFGS-B algorithm |

`upper` |
Upper bound for MLE with L-BFGS-B algorithm |

`se_max` |
Uncertainty given to those transcripts with estimates at the upper or lower bound sets. This prevents downstream errors due to abnormally high standard errors due to transcripts with extreme kinetics |

`mut_reg` |
If MLE has instabilities, empirical mut rate will be used to estimate fn, multiplying pnew by 1+mut_reg and pold by 1-mut_reg to regularize fn |

`p_mean` |
Mean of normal distribution used as prior penalty in MLE of logit(fn) |

`p_sd` |
Standard deviation of normal distribution used as prior penalty in MLE of logit(fn) |

`StanRate` |
Logical; if TRUE, a simple 'Stan' model is used to estimate mutation rates for fast_analysis; this may add a couple minutes to the runtime of the analysis. |

`Stan_data` |
List; if StanRate is TRUE, then this is the data passed to the 'Stan' model to estimate mutation rates. If using the |

`null_cutoff` |
bakR will test the null hypothesis of |effect size| < |null_cutoff| |

`NSS` |
Logical; if TRUE, logit(fn)s are compared rather than log(kdeg) so as to avoid steady-state assumption. |

`Chase` |
Logical; Set to TRUE if analyzing a pulse-chase experiment. If TRUE, kdeg = -ln(fn)/tl where fn is the fraction of reads that are s4U (more properly referred to as the fraction old in the context of a pulse-chase experiment) |

`BDA_model` |
Logical; if TRUE, variance is regularized with scaled inverse chi-squared model. Otherwise a log-normal model is used. |

`multi_pold` |
Logical; if TRUE, pold is estimated for each sample rather than use a global pold estimate. |

`Long` |
Logical; if TRUE, long read optimized fraction new estimation strategy is used. |

`kmeans` |
Logical; if TRUE, kmeans clustering on read-specific mutation rates is used to estimate pnews and pold. |

`Fisher` |
Logical; if TRUE, Fisher information is used to estimate logit(fn) uncertainty. Else, a less conservative binomial model is used, which can be preferable in instances where the Fisher information strategy often drastically overestimates uncertainty (i.e., low coverage or low pnew). |

Unless the user supplies estimates for pnew and pold, the first step of `fast_analysis`

is to estimate the background (pold)
and metabolic label (will refer to as s4U for simplicity, though bakR is compatible with other metabolic labels such as s6G)
induced (pnew) mutation rates. Several pnew and pold estimation strategies are implemented in bakR. For pnew estimation,
the two strategies are likelihood maximization of a binomial mixture model (default) and sampling from the full posterior of
a U-content adjusted Poisson mixture model with HMC (when `StanRateEst`

is set to TRUE in `bakRFit`

).

The default pnew estimation strategy involves combining the mutational data for all features into sample-wide mutational data vectors (# of T-to-C conversions, # of Ts, and # of such reads vectors). These data vectors are then downsampled (to prevent float overflow) and used to maximize the likelihood of a two-component binomial mixture model. The two components correspond to reads from old and new RNA, and the three estimated paramters are the fraction of all reads that are new (nuisance parameter in this case), and the old and new read mutation rates.

The alternative strategy involves running a fully Bayesian implementation of a similar mixture model using Stan, a
probalistic programming language that bakR makes use of in several functions. This strategy can yield more accurate
mutation rate estimates when the label times are much shorter or longer than the average half-lives of the sequenced RNA
(i.e., the fraction news are mostly close to 0 or 1, respectively). To improve the efficiency of this approach, only a small
subset of RNA features are analyzed, the number of which is set by the `RateEst_size`

parameter in `bakRFit`

. By default,
this number is set to 30, which on the average NR-seq dataset yields a several minute runtime for mutation rate estimation.

Estimation of pold can be performed with three strategies: the same two strategies discussed for pnew estimation, and a third
strategy that relies on the presence of -s4U control data. If -s4U control data is present, the default pold estimation
strategy is to use the average mutation rate in reads from all -s4U control datasets as the global pold estimate. Thus,
a single pold estimate is used for all samples. The likelihood maximization strategy can be used by
setting `no_ctl`

to TRUE, and this strategy becomes the default strategy if no -s4U data is present.
In addition, as of version 1.0.0 of bakR (released late June of 2023), users can decide to estimate
pold for each +s4U sample independently by setting `multi_pold`

to TRUE. In this case, independent -s4U datasets can no longer be used for
mutation rate estimation purposes, and thus the strategies for pold estimation are identical to the
set of pnew estimation strategies.

Once mutation rates are estimated, fraction news for each feature in each sample are estimated. The approach utilized is MLE
using the L-BFGS-B algorithm implemented in `stats::optim`

. The assumed likelihood function is derived from a Poisson mixture
model with rates adjusted according to each feature's empirical U-content (the average number of Us present in sequencing reads mapping
to that feature in a particular sample). Fraction new estimates are then converted to degradation rate constant estimates using
a solution to a simple ordinary differential equation model of RNA metabolism.

Once fraction new and kdegs are estimated, the uncertainty in these parameters is estimated using the Fisher Information. In the limit of
large datasets, the variance of the MLE is inversely proportional to the Fisher Information evaluated at the MLE. Mixture models are
typically singular, meaning that the Fisher information matrix is not positive definite and asymptotic results for the variance
do not necessarily hold. As the mutation rates are estimated a priori and fixed to be > 0, these problems are eliminated. In addition, when assessing
the uncertainty of replicate fraction new estimates, the size of the dataset is the raw number of sequencing reads that map to a
particular feature. This number is often large (>100) which increases the validity of invoking asymptotics.
As of version 1.0.0, users can opt for an alternative uncertainty estimation strategy by setting
`Fisher`

to FALSE. This strategy makes use of the standard error for the estimator of a binomial random
variables rate of success parameter. If we can uniquely identify new and old reads, then the
variance in our estimate for the fraction of reads that is new is fn*(1-fn)/n. This uncertainty
estimate will typically underestimate fraction new replicate uncertainties. We showed in the bakR
paper though that the Fisher information strategy often significantly overestimates uncertainties
of low coverage or extreme fraction new features. Therefore, this more bullish, underconservative
uncertainty quantification can be useful on datasets with low mutation rates, extreme label times,
or low sequecning depth. We have found that false discovery rates are still well controlled when using
this alternative uncertainty quantification strategy.

With kdegs and their uncertainties estimated, replicate estimates are pooled and regularized. There are two key steps in this
downstream analysis. 1st, the uncertainty for each feature is used to fit a linear ln(uncertainty) vs. log10(read depth) trend,
and uncertainties for individual features are shrunk towards the regression line. The uncertainty for each feature is a combination of the
Fisher Information asymptotic uncertainty as well as the amount of variability seen between estimates. Regularization of uncertainty
estimates is performed using the analytic results of a Normal distribution likelihood with known mean and unknown variance and conjugate
priors. The prior parameters are estimated from the regression and amount of variability about the regression line. The strength of
regularization can be tuned by adjusting the `prior_weight`

parameter, with larger numbers yielding stronger shrinkage towards
the regression line. The 2nd step is to regularize the average kdeg estimates. This is done using the analytic results of a
Normal distribution likelihood model with unknown mean and known variance and conjugate priors. The prior parameters are estimated from the
population wide kdeg distribution (using its mean and standard deviation as the mean and standard deviation of the normal prior).
In the 1st step, the known mean is assumed to be the average kdeg, averaged across replicates and weighted by the number of reads
mapping to the feature in each replicate. In the 2nd step, the known variance is assumed to be that obtained following regularization
of the uncertainty estimates.

Effect sizes (changes in kdeg) are obtained as the difference in log(kdeg) means between the reference and experimental sample(s), and the log(kdeg)s are assumed to be independent so that the variance of the effect size is the sum of the log(kdeg) variances. P-values assessing the significance of the effect size are obtained using a moderated t-test with number of degrees of freedom determined from the uncertainty regression hyperparameters and are adjusted for multiple testing using the Benjamini- Hochberg procedure to control false discovery rates (FDRs).

In some cases, the assumed ODE model of RNA metabolism will not accurately model the dynamics of a biological system being analyzed.
In these cases, it is best to compare logit(fraction new)s directly rather than converting fraction new to log(kdeg).
This analysis strategy is implemented when `NSS`

is set to TRUE. Comparing logit(fraction new) is only valid
If a single metabolic label time has been used for all samples. For example, if a label time of 1 hour was used for NR-seq
data from WT cells and a 2 hour label time was used in KO cells, this comparison is no longer valid as differences in
logit(fraction new) could stem from differences in kinetics or label times.

List with dataframes providing information about replicate-specific and pooled analysis results. The output includes:

Fn_Estimates; dataframe with estimates for the fraction new and fraction new uncertainty for each feature in each replicate. The columns of this dataframe are:

Feature_ID; Numerical ID of feature

Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)

Replicate; Numerical ID for replicate

logit_fn; logit(fraction new) estimate, unregularized

logit_fn_se; logit(fraction new) uncertainty, unregularized and obtained from Fisher Information

nreads; Number of reads mapping to the feature in the sample for which the estimates were obtained

log_kdeg; log of degradation rate constant (kdeg) estimate, unregularized

kdeg; degradation rate constant (kdeg) estimate

log_kd_se; log(kdeg) uncertainty, unregularized and obtained from Fisher Information

sample; Sample name

XF; Original feature name

Regularized_ests; dataframe with average fraction new and kdeg estimates, averaged across the replicates and regularized using priors informed by the entire dataset. The columns of this dataframe are:

Feature_ID; Numerical ID of feature

Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)

avg_log_kdeg; Weighted average of log(kdeg) from each replicate, weighted by sample and feature-specific read depth

sd_log_kdeg; Standard deviation of the log(kdeg) estimates

nreads; Total number of reads mapping to the feature in that condition

sdp; Prior standard deviation for fraction new estimate regularization

theta_o; Prior mean for fraction new estimate regularization

sd_post; Posterior uncertainty

log_kdeg_post; Posterior mean for log(kdeg) estimate

kdeg; exp(log_kdeg_post)

kdeg_sd; kdeg uncertainty

XF; Original feature name

Effects_df; dataframe with estimates of the effect size (change in logit(fn)) comparing each experimental condition to the reference sample for each feature. This dataframe also includes p-values obtained from a moderated t-test. The columns of this dataframe are:

Feature_ID; Numerical ID of feature

Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)

L2FC(kdeg); Log2 fold change (L2FC) kdeg estimate or change in logit(fn) if NSS TRUE

effect; LFC(kdeg)

se; Uncertainty in L2FC_kdeg

pval; P-value obtained using effect_size, se, and a z-test

padj; pval adjusted for multiple testing using Benjamini-Hochberg procedure

XF; Original feature name

Mut_rates; list of two elements. The 1st element is a dataframe of s4U induced mutation rate estimates, where the mut column represents the experimental ID and the rep column represents the replicate ID. The 2nd element is the single background mutation rate estimate used

Hyper_Parameters; vector of two elements, named a and b. These are the hyperparameters estimated from the uncertainties for each feature, and represent the two parameters of a Scaled Inverse Chi-Square distribution. Importantly, a is the number of additional degrees of freedom provided by the sharing of uncertainty information across the dataset, to be used in the moderated t-test.

Mean_Variance_lms; linear model objects obtained from the uncertainty vs. read count regression model. One model is run for each Exp_ID

```
# Simulate small dataset
sim <- Simulate_bakRData(300, nreps = 2)
# Fit fast model to get fast_df
Fit <- bakRFit(sim$bakRData)
# Fit fast model with fast_analysis
Fast_Fit <- fast_analysis(Fit$Data_lists$Fast_df)
```

[Package *bakR* version 1.0.0 Index]