cBprocess {bakR}R Documentation

Curate data in bakRData object for statistical modeling


cBprocess creates the data structures necessary to analyze nucleotide recoding RNA-seq data with any of the statistical model implementations in bakRFit. The input to cBprocess must be an object of class bakRData.


  high_p = 0.2,
  totcut = 50,
  totcut_all = 10,
  Ucut = 0.25,
  AvgU = 4,
  Stan = TRUE,
  Fast = TRUE,
  FOI = c(),
  concat = TRUE



An object of class bakRData


Numeric; Any transcripts with a mutation rate (number of mutations / number of Ts in reads) higher than this in any no s4U control samples are filtered out


Numeric; Any transcripts with less than this number of sequencing reads in any replicate of all experimental conditions are filtered out


Numeric; Any transcripts with less than this number of sequencing reads in any sample are filtered out


Numeric; All transcripts must have a fraction of reads with 2 or less Us less than this cutoff in all samples


Numeric; All transcripts must have an average number of Us greater than this cutoff in all samples


Boolean; if TRUE, then data_list that can be passed to 'Stan' is curated


Boolean; if TRUE, then dataframe that can be passed to fast_analysis() is curated


Features of interest; character vector containing names of features to analyze. If FOI is non-null and concat is TRUE, then all minimally reliable FOIs will be combined with reliable features passing all set filters (high_p, totcut, totcut_all, Ucut, and AvgU). If concat is FALSE, only the minimally reliable FOIs will be kept. A minimally reliable FOI is one that passes filtering with minimally stringent parameters.


Boolean; If TRUE, FOI is concatenated with output of reliableFeatures


The 1st step executed by cBprocess is to find the names of features which are deemed "reliable". A reliable feature is one with sufficient read coverage in every single sample (i.e., > totcut_all reads in all samples), sufficient read coverage in at all replicates of at least one experimental condition (i.e., > totcut reads in all replicates for one or more experimental conditions) and limited mutation content in all -s4U control samples (i.e., < high_p mutation rate in all samples lacking s4U feeds). In addition, if analyzing short read sequencing data, two additional definitons of reliable features become pertinent: the fraction of reads that can have 2 or less Us in each sample (Ucut) and the minimum average number of Us for a feature's reads in each sample (AvgU). This is done with a call to reliableFeatures.

The 2nd step is to extract only reliableFeatures from the cB dataframe in the bakRData object. During this process, a numerical ID is given to each reliableFeature, with the numerical ID corresponding to their order when arranged using dplyr::arrange.

The 3rd step is to prepare a dataframe where each row corresponds to a set of n identical reads (that is they come from the same sample and have the same number of mutations and Us). Part of this process involves assigning an arbitrary numerical ID to each replicate in each experimental condition. The numerical ID will correspond to the order the sample appears in metadf. The outcome of this step is multiple dataframes with variable information content. These include a dataframe with information about read counts in each sample, one which logs the U-contents of each feature, one which is compatible with fast_analysis and thus groups reads by their number of mutations as well as their number of Us, and one which is compatible with TL_stan with StanFit == TRUE and thus groups ready by only their number of mutations. At the end of this step, two other smaller data structures are created, one which is an average count matrix (a count matrix where the ith row and jth column corresponds to the average number of reads mappin to feature i in experimental condition j, averaged over all replicates) and the other which is a sample lookup table that relates the numerical experimental and replicate IDs to the original sample name.


returns list of objects that can be passed to TL_stan and/or fast_analysis. Those objects are:


# Load cB

# Load metadf

# Create bakRData
bakRData <- bakRData(cB_small, metadf)

# Preprocess data
data_for_bakR <- cBprocess(obj = bakRData)

[Package bakR version 1.0.0 Index]