simulateigaseq {IgAScores} | R Documentation |
Simulate an IgA-Seq dataset from a pre-defined set of IgA-binding probabilities
Description
Simulates IgA-Seq to create datasets with a defined binding distribution that can be used to test scoring method performance
Usage
simulateigaseq(
igavalmeans = NULL,
igavalsds = NULL,
nosamples = 10,
samplingdepth = 1e+05,
posthresh = 4,
negthresh = 2,
seed = 66,
betweengroups = FALSE,
betweenper = 10,
betweensp = NULL
)
Arguments
igavalmeans |
A vector of mean IgA values for as many species as you wish to simulate. Will default to an exponentially distributed vector of 10 species. |
igavalsds |
A vector of standard deviations that will be used to generate IgA value distributions alongside the means. Defaults to 1 for all values. |
nosamples |
The number of samples to generate simulated data from. Defaults to 10. |
samplingdepth |
The number of bacteria to simulate in each sample. Defaults to 100000. |
posthresh |
The IgA value threshold above which a bacteria will be considered IgA positive. Defaults to 4 (which is reasonable with the other defaults). It is recommended to run a simulation twice to determine reasonable thresholds on the first go. |
negthresh |
The IgA value threshold below which a bacteria will be considered IgA negative. Defaults to 2 (which is reasonable with the other defaults). It is recommended to run a simulation twice to determine reasonable thresholds on the first go. |
seed |
Seed for random number generation. Has a default so must be changed to rerun simulations. |
betweengroups |
If TRUE this will modify starting abundances of half of the samples similarly (by adding betweenper% of total counts to a single species) to simulate the case where there is an abundance shift without a change in IgA binding affinity. Defaults to FALSE. |
betweenper |
Percentage of total counts to add to a species in the second group in the betweengroups mode. |
betweensp |
Species (by index) to increased in between groups simulation. Chosen at random if NULL (default). |
Details
This function will generate a simulated immunoglobulin A sequencing (IgA-Seq) data set starting from a list containing the mean (and standard deviations) of IgA binding values expected for each species and cut-offs for defining the IgA positive and negative gates. The input is a vector giving the average IgA value of each species (any arbitrary value that will represent the relative level of IgA binding between the species, ensure standard deviation and cut-offs are in the same magnitude). These values are treated as the means of a normal distribution of IgA binding values for each species. Species counts are generated on a log distribution for a given number of samples at an even depth. For each bacteria in each sample, an IgA binding value is then assigned by sampling from its species IgA value distribution. The value thresholds defining the positive and negative gates are then used to generate positive and negative counts tables of the bacteria whose values fall into these groups. A second mode can also be used (by toggling betweengroups) that will introduce a consistent abundance change in half the samples by increasing one species in them. This can be used to simulate case-control experiments where, as an example, one taxa has bloomed. Further details can be found in Jackson et al. (2020, doi: 10.1101/2020.08.19.257501).
Note: IgA values are simulated for each bacteria in each sample, setting the combination of the samplingdepth, number of species, and number of samples too high will slow the data generation.
Value
A list containing the simulated data set and relevant input parameters.
presortcounts - A data frame containing simulated species counts for each sample in the pre-sort sample.
presortabunds - presortcounts as relative abundances.
poscounts - A data frame containing simulated species counts for each sample in the IgA positive fraction.
posabunds - poscounts as relative abundances.
negcounts - A data frame containing simulated species counts for each sample in the IgA negative fraction.
negabunds - negcounts as relative abundances.
possizes - A vector of the IgA positive fraction sizes for each sample.
negsizes - A vector of the IgA negative fraction sizes for each sample.
igabinding - A long format data frame containing the simulated IgA binding values for all simulated bacteria used to generate the count tables.
igavalmeans - A vector of the mean IgA values for each species used in the simulation.
igavalsds - A vector of the standard deviations of the IgA values for each species used in the simulation.
posthresh - Numeric, the lower threshold used to determine a bacteria is IgA postive in the simulation.
negthresh - Numeric, the upper threshold used to determine a bacteria is IgA negative in the simulation.
expgroup - A vector showing class labels for the experimental group of each sample in the experiment. Will be uniform unless doing between group simulations.
expspecies - Numeric, showing which species was modelled as differentially abundant between experimental groups when carryingout between group simulations.
Examples
dat <- simulateigaseq(c(0.1,1,10,15),rep(1,4),posthresh=8,negthresh=4,samplingdepth=100)