simulate_data {eCV} | R Documentation |
Simulates omic features into reproducible and irreproducible groups
Description
This function is an extension of the copula mixture model simulations presented in Li et al. (2011). It generates samples of n_features pairs of omic features for n_reps (>=2) replicates. The state of each omic feature (i.e., reproducible or irreproducible) is determined by sampling from a binomial variable K with a vector of probabilities, P. The vector P represents the mixing probability between two multivariate normal distributions. The elements of P are associated with reproducibility. For example, if K can only assume two values, say 0 or 1, then K can represent groups of reproducible or irreproducible features.
Usage
simulate_data(n_reps = 2, n_features = 10000, scenario = 1)
Arguments
n_reps |
Number of sample replicates. Numeric. Defaults to 2. |
n_features |
Number of omic features to simulate. Numeric. Defaults to 1e4. |
scenario |
Combination of parameters' values defining scenarios in Li et al. (2011). Numeric. Possible values are 1, 2, 3, or 4. Defaults to 1. |
Details
The dimension of each normal distribution is determined by the number of replicates, r. The "scenario" argument controls the values of the parameters according to the simulation scenarios outlined in Li et al. (2011) (Table I in the article). Scenario 1 corresponds to a situation where reproducible features are highly correlated and exceed the number of irreproducible features. Scenario 2 corresponds to a situation where the reproducible features are less than the irreproducible ones and exhibit low correlation. Scenario 3 represents situations where reproducible features are less than irreproducible ones but still highly correlated. Scenario 4 is a generalization of Scenario 1, with the addition of a component of “reproducible noise” consisting of highly correlated but low-intensity features.
Value
Returns a list of two elements:
-
sim_data: Matrix of dimensions n_features x n_reps with the simulated numerical values for each feature.
-
sim_params: List with all the parameter values.
References
Q. Li, J. B. Brown, H. Huang, and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.
Examples
library(eCV)
set.seed(42)
out <- simulate_data(scenario = 1)
library(tidyverse)
out$sim_data %>% as.data.frame() %>%
mutate(`Features group` = as.character(out$sim_params$feature_group)) %>%
ggplot(aes(x=`Rep 1`,y=`Rep 2`,color=`Features group`)) +
geom_point(size=1, alpha=0.5) +
scale_color_manual(values = c( "#009CA6" , "#F4364C")) +
theme_classic()