bakRFit {bakR}  R Documentation 
bakRFit
analyzes nucleotide recoding RNAseq data to estimate
kinetic parameters relating to RNA stability and changes in RNA
stability induced by experimental perturbations. Several statistical
models of varying efficiency and accuracy can be used to fit data.
bakRFit(
obj,
StanFit = FALSE,
HybridFit = FALSE,
high_p = 0.2,
totcut = 50,
totcut_all = 10,
Ucut = 0.25,
AvgU = 4,
FastRerun = FALSE,
FOI = c(),
concat = TRUE,
StanRateEst = FALSE,
RateEst_size = 30,
low_reads = 100,
high_reads = 5e+05,
chains = 1,
NSS = FALSE,
Chase = FALSE,
BDA_model = FALSE,
multi_pold = FALSE,
Long = FALSE,
kmeans = FALSE,
ztest = FALSE,
Fisher = TRUE,
...
)
obj 

StanFit 
Logical; if TRUE, then the MCMC implementation is run. Will only be used if 
HybridFit 
Logical; if TRUE, then the Hybrid implementation is run. Will only be used if 
high_p 
Numeric; Any features with a mutation rate (number of mutations / number of Ts in reads) higher than this in any s4U control samples (i.e., samples that were not treated with s4U) are filtered out 
totcut 
Numeric; Any features with less than this number of sequencing reads in any replicate of all experimental conditions are filtered out 
totcut_all 
Numeric; Any features with less than this number of sequencing reads in any sample are filtered out 
Ucut 
Numeric; All features must have a fraction of reads with 2 or less Us less than this cutoff in all samples 
AvgU 
Numeric; All features must have an average number of Us greater than this cutoff in all samples 
FastRerun 
Logical; only matters if a bakRFit object is passed to 
FOI 
Features of interest; character vector containing names of features to analyze 
concat 
Logical; If TRUE, FOI is concatenated with output of reliableFeatures 
StanRateEst 
Logical; if TRUE, a simple 'Stan' model is used to estimate mutation rates for fast_analysis; this may add a couple minutes to the runtime of the analysis. 
RateEst_size 
Numeric; if StanRateEst is TRUE, then data from RateEst_size genes are used for mutation rate estimation. This can be as low as 1 and should be kept low to ensure maximum efficiency 
low_reads 
Numeric; if StanRateEst is TRUE, then only features with more than low_reads reads in all samples will be used for mutation rate estimation 
high_reads 
Numeric; if StanRateEst is TRUE, then only features with less than high_read reads in all samples will be used for mutation rate estimation. A high read count cutoff is as important as a low read count cutoff in this case because you don't want the fraction labeled of chosen features to be extreme (e.g., close to 0 or 1), and high read count features are likely low fraction new features. 
chains 
Number of Markov chains to sample from. 1 should suffice since these are validated models. Running more chains is generally preferable, but memory constraints can make this unfeasible. 
NSS 
Logical; if TRUE, logit(fn)s are directly compared to avoid assuming steadystate 
Chase 
Logical; Set to TRUE if analyzing a pulsechase experiment. If TRUE, kdeg = ln(fn)/tl where fn is the fraction of reads that are s4U (more properly referred to as the fraction old in the context of a pulsechase experiment). 
BDA_model 
Logical; if TRUE, variance is regularized with scaled inverse chisquared model. Otherwise a lognormal model is used. 
multi_pold 
Logical; if TRUE, pold is estimated for each sample rather than use a global pold estimate. 
Long 
Logical; if TRUE, long read optimized fraction new estimation strategy is used. 
kmeans 
Logical; if TRUE, kmeans clustering on readspecific mutation rates is used to estimate pnews and pold. 
ztest 
Logical; if TRUE and the MLE implementation is being used, then a ztest will be used for pvalue calculation rather than the more conservative moderated ttest. 
Fisher 
Logical; if TRUE, Fisher information is used to estimate logit(fn) uncertainty. Else, a less conservative binomial model is used, which can be preferable in instances where the Fisher information strategy often drastically overestimates uncertainty (i.e., low coverage or low pnew). 
... 
Arguments passed to either 
If bakRFit
is run on a bakRData object, cBprocess
and then fast_analysis
will always be called. The former will generate the processed
data that can be passed to the model fitting functions (fast_analysis
and TL_Stan
). The call to fast_analysis
will generate a list of dataframes
containing information regarding the fast_analysis
fit. fast_analysis
is always
called because its output is required for both Hybrid_fit
and TL_Stan
.
If bakRFit
is run on a bakRFit object, cBprocess
will not be called again,
as the output of cBprocess
will already be contained in the bakRFit object. Similarly,
fast_analysis
will not be called again unless bakRFit is rerun on the bakRData object.
or if FastRerun
is set to TRUE. If you want to generate model fits using different parameters for cBprocess,
you will have to rerun bakRFit
on the bakRData object.
If bakRFit
is run on a bakRFnData object, fn_process
and then avg_and_regularize
will always be called. The former will generate the processed data that can be passed to the model
fitting functions (avg_and_regularize
and TL_Stan
, the latter only with HybridFit = TRUE).
If bakRFit
is run on a bakRFnFit object. fn_process
will not be called again, as
the output of fn_process
will already be contained in the bakRFnFit object. Similary,
avg_and_regularize
will not be called unless bakRFit
is rerun on the bakRData object,
or if FastRerun
is set to TRUE. If you want to generate model fits using different
parameters for fn_process
, you will have to rerun bakRFit
on the bakRData object.
See the documentation for the individual fitting functions for details regarding how they analyze nucleotide recoding data. What follows is a brief overview of how each works
fast_analysis
(referred to as the MLE implementation in the bakR paper)
either estimates mutation rates from + and (if available)  s4U samples or uses mutation rate estimates
provided by the user to perform maximum likelihood estimation (MLE) of the fraction of RNA that is labeled for each
replicate of nucleotide recoding data provided. Uncertainties for each replicate's estimate are approximated using
asymptotic results involving the Fisher Information and assuming known mutation rates. Replicate data
is pooled using an approximation to hierarchical modeling that relies on analytic solutions to simple Bayesian models.
Linear regression is used to estimate the relationship between read depths and replicate variability for uncertainty
estimation regularization, again performed using analytic solutions to Bayesian models.
TL_Stan
with Hybrid_Fit set to TRUE (referred to as the Hybrid implementation in the bakR paper)
takes as input estimates of the logit(fraction new) and uncertainty provided by fast_analysis
.
It then uses 'Stan' on the backend to implement a hierarchical model that pools data across replicates and the dataset
to estimate effect sizes (L2FC(kdeg)) and uncertainties. Replicate variability information is pooled across each experimental
condition to regularize variance estimates using a hierarchical linear regression model.
The default behavior of TL_Stan
(referred to as the MCMC implementation in the bakR paper)
is to use 'Stan' on the back end to implement a Ucontent exposure adjusted Poisson mixture model
to estimate fraction news from the mutational data. Partial pooling of replicate variability estimates
is performed as with the Hybrid implementation.
bakRFit object with results from statistical modeling and data processing. Objects possibly included are:
Fast_Fit; Always will be present. Output of fast_analysis
Hybrid_Fit; Only present if HybridFit = TRUE. Output of TL_stan
Stan_Fit; Only present if StanFit = TRUE. Output of TL_stan
Data_lists; Always will be present. Output of cBprocess
with Fast and Stan == TRUE
# Simulate data for 1000 genes, 2 replicates, 2 conditions
simdata < Simulate_bakRData(1000, nreps = 2)
# You always must fit fast implementation before any others
Fit < bakRFit(simdata$bakRData)