TL_stan {bakR} | R Documentation |

`TL_stan`

is an internal function to analyze nucleotide recoding RNA-seq data with a fully
Bayesian hierarchical model implemented in the PPL `Stan`

. `TL_stan`

estimates
kinetic parameters and differences in kinetic parameters between experimental
conditions. When assessing differences, a single reference sample is compared to
each collection of experimental samples provided.

```
TL_stan(
data_list,
Hybrid_Fit = FALSE,
keep_fit = FALSE,
NSS = FALSE,
chains = 1,
...
)
```

`data_list` |
List to pass to 'Stan' of form given by |

`Hybrid_Fit` |
Logical; if TRUE, Hybrid 'Stan' model that takes as data output of |

`keep_fit` |
Logical; if TRUE, 'Stan' fit object is included in output; typically large file so default FALSE. |

`NSS` |
Logical; if TRUE, models that directly compare logit(fn)s are used to avoid steady-state assumption |

`chains` |
Number of Markov chains to sample from. The default is to only run a single chain. Typical NR-seq datasets yield very memory intensive analyses, but running a single chain should decrease this burden. For reference, running the MCMC implementation (Hybrid_Fit = FALSE) with 3 chains on an NR-seq dataset with 3 replicates of 2 experimental conditions with around 20 million raw (unmapped) reads per sample requires over 100 GB of RAM. With a single chain, this burden drops to around 20 GB. Due to memory demands and time constraints (runtimes for the MCMC implementation border will likely be around 1-2 days) means that these models should usually be run in a specialized High Performance Computing (HPC) system. |

`...` |
Arguments passed to |

Two implementations of a similar model can be fit with TL_stan: a complete nucleotide recoding RNA-seq
analysis and a hybrid analysis that takes as input results from `fast_analysis`

.
In the complete analysis (referred to in the bakR publication as the MCMC implementation),
U-to-C mutations are modeled as coming from a Poisson distribution
with rate parameter adjusted by the empirical U-content of each feature analyzed. Features
represent whatever the user defined them to be when constructing the bakR data object.
Typical feature categories are genes, exons, etc. Hierarchical modeling is used to pool data
across replicates and across features. More specifically, replicate data for the
same feature are partially pooled to estimate feature-specific mean fraction news and uncertainties.
Feature means are partially pooled to estimate dataset-wide mean fraction news and standard deviations.
The replicate variability for each feature is also partially pooled to determine a condition-wide
heteroskedastic relationship between read depths and replicate variability. Partial pooling
reduces subjectivity when determining priors by allowing the model to determine what priors make sense
given the data. Partial pooling also regularizes estimates, reducing estimate variability and thus increasing
estimate accuracy. This is particularly important for replicate variability estimates, which often rely
on only a few replicates of data per feature and thus are typically highly unstable.

The hybrid analysis (referred to in the bakR publication as the Hybrid implementation)
inherits the hierarchical modeling structure of the complete analysis, but reduces computational
burden by foregoing per-replicate-and-feature fraction new estimation and uncertainty quantification. Instead,
the hybrid analysis takes as data fraction new estimates and approximate uncertainties from `fast_analysis`

.
Runtimes of the hybrid analysis are thus often an order of magnitude shorter than with the complete analysis, but
loses some accuracy by relying on point estimates and uncertainty quantification that is only valid in the
limit of large dataset sizes (where the dataset size for the per-replicate-and-feature fraction new estimate is the raw number
of sequencing reads mapping to the feature in that replicate).

Users also have the option to save or discard the `Stan`

fit object. Fit objects can be exceedingly large (> 10 GB) for most
nucleotide recoding RNA-seq datasets. Therefore, if you don't want to store such a large object, a summary object will be saved instead,
which greatly reduces the size of the output (~ 10-50 MB) while still retaining much of the important information. In addition,
the output of `TL_stan`

provides the estimates and uncertainties for key parameters (L2FC(kdeg), kdeg, and fraction new)
that will likely be of most interest. That being said, there are some analyses that are only possible if the original fit object
is saved. For example, the fit object will contain all of the samples from the posterior collected during model fitting. Thus,
new parameters (e.g., L2FC(kdeg)'s comparing two experimental samples) not naturally generated by the model can be estimated
post-hoc. Still, there are often approximate estimates that can be obtained for such parameters that don't rely on the full
fit object. One analysis that is impossible without the original fit object is generating model diagnostic plots. These include
trace plots (to show mixing and efficient parameter space exploration of the Markov chains), pairs plots (to show correlations
between parameters and where any divergences occurred), and other visualizations that can help users assess how well a model
ran. Because the models implemented by `TL_stan`

are extensively validated, it is less likely that such diagnostics will be helpful,
but often anomalies on your data can lead to poor model convergence, in which case assessing model diagnostics can help you
identify the source of problems in your data. Summary statistics describing how well the model was able to estimate each parameter
(n_eff and rhat) are provided in the fit summaries, but can often obscure some of the nuanced details of model fitting.

A list of objects:

Effects_df; dataframe with estimates of the effect size (change in logit(fn)) comparing each experimental condition to the reference sample for each feature. This dataframe also includes p-values obtained from a moderated t-test. The columns of this dataframe are:

Feature_ID; Numerical ID of feature

Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)

L2FC_kdeg; L2FC(kdeg) posterior mean

L2FC_kd_sd; L2FC(kdeg) posterior sd

effect; identical to L2FC_kdeg (kept for symmetry with MLE fit output)

se; identical to L2FC_kd_sd (kept for symmetry with MLE fit output)

XF; Feature name

pval; p value obtained from effect and se + z-test

padj; p value adjusted for multiple testing using Benjamini-Hochberg procedure

Kdeg_df; dataframe with estimates of the kdeg (RNA degradation rate constant) for each feature, averaged across replicate data. The columns of this dataframe are:

Feature_ID; Numerical ID of feature

Exp_ID; Numerical ID for experimental condition

kdeg; Degradation rate constant posterior mean

kdeg_sd; Degradation rate constant posterior standard deviation

log_kdeg; Log of degradation rate constant posterior mean (as of version 1.0.0)

log_kdeg_sd; Log of degradation rate constant posterior standard deviation (as of version 1.0.0)

XF; Original feature name

Fn_Estimates; dataframe with estimates of the logit(fraction new) for each feature in each replicate. The columns of this dataframe are:

Feature_ID; Numerical ID for feature

Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)

Replicate; Numerical ID for replicate

logit_fn; Logit(fraction new) posterior mean

logit_fn_se; Logit(fraction new) posterior standard deviation

sample; Sample name

XF; Original feature name

Fit_Summary; only outputted if keep_fit == FALSE. Summary of 'Stan' fit object with each row corresponding to a particular parameter. All posterior point descriptions are descriptions of the marginal posterior distribution for the parameter in that row. For example, the posterior mean is the average value for the parameter when averaging over all other parameter values. The columns of this dataframe are:

mean; Posterior mean for the parameter given by the row name

se_mean; Standard error of the posterior mean; essentially how confident the model is that what it estimates to be the posterior mean is what the posterior mean actually is. This will depend on the number of chains run on the number of iterations each chain is run for.

sd; Posterior standard deviation

2.5%; 2.5th percentile of the posterior distribution. 2.5% of the posterior mass is below this point

25%; 25th percentile of the posterior distribution

50%; 50th percentile of the posterior distribution

75%; 75th percentile of the posterior distribution

97.5%; 97.5th percentile of the posterior distribution

n_eff; Effective sample size. The larger this is the better, though it should preferably be around the total number of iterations (iter x chains). Small values of this could represent poor model convergence

Rhat; Describes how well separate Markov chains mixed. This is preferably as close to 1 as possible, and values higher than 1 could represent poor model convergence

Stan_Fit; only outputted if keep_fit == TRUE. This is the full 'Stan' fit object, an R6 object of class

`stanfit`

Mutation_Rates; data frame with information about mutation rate estimates. Has the same columns as Fit_Summary. Each row corresponds to either a background mutation rate (log_lambda_o) or an s4U induced mutation rate (log_lambda_n), denoted in the parameter column. The bracketed portion of the parameter name will contain two numbers. The first corresponds to the Exp_ID and the second corresponds to the Replicate_ID. For example, if the parameter name is log_lambda_o[1,2] then that row corresponds to the background mutation rate in the second replicate of experimental condition one. A final point to mention is that the estimates are on a log(avg. # of mutations) scale. So a log_lambda_n of 1 means that on average, there are an estimated 2.72 (exp(1)) mutations in reads from new RNA (i.e., RNA synthesized during s4U labeling).

[Package *bakR* version 1.0.0 Index]