Simulate_bakRData {bakR} | R Documentation |

`Simulate_bakRData`

simulates a `bakRData`

object. It's output also includes the simulated
values of all kinetic parameters of interest. Only the number of genes (`ngene`

) has to be set by the
user, but an extensive list of additional parameters can be adjusted.

```
Simulate_bakRData(
ngene,
num_conds = 2L,
nreps = 3L,
eff_sd = 0.75,
eff_mean = 0,
fn_mean = 0,
fn_sd = 1,
kslog_c = 0.8,
kslog_sd = 0.95,
tl = 60,
p_new = 0.05,
p_old = 0.001,
read_lengths = 200L,
p_do = 0,
noise_deg_a = -0.3,
noise_deg_b = -1.5,
noise_synth = 0.1,
sd_rep = 0.05,
low_L2FC_ks = -1,
high_L2FC_ks = 1,
num_kd_DE = c(0L, as.integer(rep(round(as.integer(ngene)/2), times =
as.integer(num_conds) - 1))),
num_ks_DE = rep(0L, times = as.integer(num_conds)),
scale_factor = 150,
sim_read_counts = TRUE,
a1 = 5,
a0 = 0.01,
nreads = 50L,
alpha = 25,
beta = 75,
STL = FALSE,
STL_len = 40,
lprob_U_sd = 0,
lp_sd = 0
)
```

`ngene` |
Number of genes to simulate data for |

`num_conds` |
Number of experimental conditions (including the reference condition) to simulate |

`nreps` |
Number of replicates to simulate |

`eff_sd` |
Effect size; more specifically, the standard deviation of the normal distribution from which non-zero changes in logit(fraction new) are pulled from. |

`eff_mean` |
Effect size mean; mean of normal distribution from which non-zero changes in logit(fraction new) are pulled from. Note, setting this to 0 does not mean that some of the significant effect sizes will be 0, as any exact integer is impossible to draw from a continuous random number generator. Setting this to 0 just means that there is symmetric stabilization and destabilization |

`fn_mean` |
Mean of fraction news of simulated transcripts in reference condition. The logit(fraction) of RNA from each transcript that is metabolically labeled (new) is drawn from a normal distribution with this mean |

`fn_sd` |
Standard deviation of fraction news of simulated transcripts in reference condition. The logit(fraction) of RNA from each transcript that is metabolically labeled (new) is drawn from a normal distribution with this sd |

`kslog_c` |
Synthesis rate constants will be drawn from a lognormal distribution with meanlog = |

`kslog_sd` |
Synthesis rate lognormal standard deviation; see kslog_c documentation for details |

`tl` |
metabolic label feed time |

`p_new` |
metabolic label (e.g., s4U) induced mutation rate. Can be a vector of length num_conds |

`p_old` |
background mutation rate |

`read_lengths` |
Total read length for each sequencing read (e.g., PE100 reads correspond to read_lengths = 200) |

`p_do` |
Rate at which metabolic label containing reads are lost due to dropout; must be between 0 and 1 |

`noise_deg_a` |
Slope of trend relating log10(standardized read counts) to log(replicate variability) |

`noise_deg_b` |
Intercept of trend relating log10(standardized read counts) to log(replicate variability) |

`noise_synth` |
Homoskedastic variability of L2FC(ksyn) |

`sd_rep` |
Variance of lognormal distribution from which replicate variability is drawn |

`low_L2FC_ks` |
Most negative L2FC(ksyn) that can be simulated |

`high_L2FC_ks` |
Most positive L2FC(ksyn) that can be simulated |

`num_kd_DE` |
Vector where each element represents the number of genes that show a significant change in stability relative to the reference. 1st entry must be 0 by definition (since relative to the reference the reference sample is unchanged) |

`num_ks_DE` |
Same as num_kd_DE but for significant changes in synthesis rates. |

`scale_factor` |
Factor relating RNA concentration (in arbitrary units) to average number of read counts |

`sim_read_counts` |
Logical; if TRUE, read counts are simulated as coming from a heterodisperse negative binomial distribution |

`a1` |
Heterodispersion 1/reads dependence parameter |

`a0` |
High read depth limit of negative binomial dispersion parameter |

`nreads` |
Number of reads simulated if sim_read_counts is FALSE |

`alpha` |
shape1 parameter of the beta distribution from which U-contents (probability that a nucleotide in a read from a transcript is a U) are drawn for each gene. |

`beta` |
shape2 parameter of the beta distribution from which U-contents (probability that a nucleotide in a read from a transcript is a U) are drawn for each gene. |

`STL` |
logical; if TRUE, simulation is of STL-seq rather than a standard TL-seq experiment. The two big changes are that a short read length is required (< 60 nt) and that every read for a particular feature will have the same number of Us. Only one read length is simulated for simplicity. |

`STL_len` |
Average length of simulated STL-seq length. Since Pol II typically pauses about 20-60 bases from the promoter, this should be around 40 |

`lprob_U_sd` |
Standard deviation of the logit(probability nt is a U) for each sequencing read. The number of Us in a sequencing read are drawn from a binomial distribution with prob drawn from a logit-Normal distribution with this logit-sd. |

`lp_sd` |
Standard deviation of logit(probability a U is mutated) for each U. The number of mutations in a given read is the sum of nU Bernoulli random variables, where nU is the number of Us, and p is drawn from a logit-normal distribution with lp_sd standard deviation on logit scale. |

`Simulate_bakRData`

simulates a `bakRData`

object using a realistic generative model with many
adjustable parameters. Average RNA kinetic parameters are drawn from biologically inspired
distributions. Replicate variability is simulated by drawing a feature's
fraction new in a given replicate from a logit-Normal distribution with a heteroskedastic
variance term with average magnitude given by the chosen read count vs. variance relationship.
For each replicate, a feature's ksyn is drawn from a homoskedastic lognormal distribution. Read counts
can either be set to the same value for all simulated features or can be simulated according to
a heterodisperse negative binomial distribution. The latter is the default

The number of Us in each sequencing read is drawn from a binomial distribution with number of trials
equal to the read length and probability of each nucleotide being a U drawn from a beta distribution. Each read is assigned to the
new or old population according to a Bernoulli distribution with p = fraction new. The number of
mutations in each read are then drawn from one of two binomial distributions; if the read is assigned to the
population of new RNA, the number of mutations are drawn from a binomial distribution with number of trials equal
to the number of Us and probability of mutation = `p_new`

; if the read is assigned to the population of old RNA,
the number of mutations is instead drawn from a binomial distribution with the same number of trials but with the probability
of mutation = `p_old`

. `p_new`

must be greater than `p_old`

because mutations in new RNA
arise from both background mutations that occur with probability `p_old`

as well as metabolic label induced mutations

Simulated read counts should be treated as if they are spike-in and RPKM normalized, so the same scale factor of 1 can be applied to each sample when comparing the sequencing reads (e.g., if you are performing differential expression analysis).

Function to simulate a `bakRData`

object according to a realistic generative model

A list containing a simulated `bakRData`

object as well as a list of simulated kinetic parameters of interest.
The contents of the latter list are:

Effect_sim; Dataframe meant to mimic formatting of Effect_df that are part of

`bakRFit(StanFit = TRUE)`

,`bakRFit(HybridFit = TRUE)`

and`bakRFit(bakRData object)`

output.Fn_mean_sim; Dataframe meant to mimic formatting of Regularized_ests that is part of

`bakRFit(bakRData object)`

output. Contains information about the true fraction new simulated in each condition (the mean of the normal distribution from which replicate fraction news are simulated)Fn_rep_sim; Dataframe meant to mimic formatting of Fn_Estimates that is part of \codebakRFit(bakRData object) output. Contains information about the fraction new simulated for each feature in each replicate of each condition.

L2FC_ks_mean; The true L2FC(ksyn) for each feature in each experimental condition. The i-th column corresponds to the L2FC(ksyn) when comparing the i-th condition to the reference condition (defined as the 1st condition) so the 1st column is always all 0s

RNA_conc; The average number of normalized read counts expected for each feature in each sample.

```
# 2 replicate, 2 experimental condition, 1000 gene simulation
sim_2reps <- Simulate_bakRData(ngene = 1000, nreps = 2)
# 3 replicate, 2 experimental condition, 1000 gene simulation
# with 100 instances of differential degradation kinetics
sim_3reps <- Simulate_bakRData(ngene = 1000, num_kd_DE = c(0, 100))
# 2 replicates, 3 experimental condition, 1000 gene simulation
# with 100 instances of differential degradation kinetics in the 1st
# condition and no instances of differential degradation kinetics in the
# 2nd condition
sim_3es <- Simulate_bakRData(ngene = 1000,
nreps = 2,
num_conds = 3,
num_kd_DE = c(0, 100, 0))
```

[Package *bakR* version 1.0.0 Index]