SimData {SimSeq} | R Documentation |
SimData
Description
Given a matrix of RNA-seq data with large samples sizes in at least two treatment groups, SimData simulates a new matrix of RNA-seq data with a known list of differentially expressed (DE) and equivalently expressed (EE) genes from the original data.
Usage
SimData(counts, treatment, replic = NULL, sort.method, k.ind,
n.genes = NULL, n.diff = NULL, norm.factors = NULL,
samp.independent = FALSE, genes.select = NULL, genes.diff = NULL, switch.trt = FALSE,
probs = NULL, weights = NULL, exact = FALSE, power = 1)
Arguments
counts |
A matrix of counts where each row specifies a gene and each column specifies a replicate. |
treatment |
A vector specifying the treatment group for each column of the counts matrix. Only two treatment groups of either paired or unpaired data are allowed. |
replic |
A vector specifying the replicate for each column of the counts matrix when there is paired data; optional if data is unpaired. |
sort.method |
One of either "paired" or "unpaired", depending on the structure of the counts matrix. |
k.ind |
The number of experimental units to be simulated for each treatment group. |
n.genes |
The number of genes to be subsetted from the counts matrix in the simulation. Must be less than the total number of rows in the counts matrix and greater than n.diff. Optional if genes.select vector is specified. |
n.diff |
The number of genes simulated to be differentially expressed. Must be less than n.genes. Optional if genes.diff vector is specified. |
norm.factors |
A positive numeric vector of multiplicative normalization factors for each column of the counts matrix. Will default to CalcNormFactors function from the package edgeR with method = "TMM". |
samp.independent |
Should columns be resampled for each gene. (Defaults to FALSE - setting to TRUE is not recommended.) |
genes.select |
A vector specifying genes to be subsetted from the counts matrix in the simulation. Can be either a logical vector or a numeric vector indicating the rows of the counts matrix to be used. Optional if n.genes is specified. |
genes.diff |
A vector specifying genes to be differentially expressed in the simulation. Genes selected to be differentially expressed must be a subset of the genes selected in the genes.select vector. Can be either logical vector with length equal to genes selected or a numeric vector indicating the rows of the counts matrix to be used. Optional if n.diff is specified. |
switch.trt |
Logical specifying which treatment group should be sampled from for EE genes. Default is first treatment group. |
probs |
Optional vector specifying the p-value of differential expression for each gene to be used in the estimate of empirical Bayes probability for each gene. If not provided and weights are not specified, SimData will perform either a signed rank test (paired case) or a rank sum test (unpaired) case for each gene in the counts matrix. |
weights |
Optional vector specifying weights to be used for sampling which genes are to be differentially expressed in the simulation. If null, weights will be calculated using the fdrtool function from the package 'fdrtool' to calculate one minus local fdr. If desired, the sampling of differentially expressed genes can be done without respect to any weights by providing a vector of ones. |
exact |
Specifies whether an exact signed rank test (paired) or exact ranksum test (unpaired) should be used. |
power |
Transorms the weights for each gene by raising each weights to some power. Must be greater than 0. Default is set to 1. |
Details
SimData simulates an RNA-seq matrix of counts from a large source RNA-sequence dataset using a resampling based approach. If you use the Kidney Cancer (KIRC) dataset from the The Cancer Genome Atlas as the source RNA-seq counts matrix in the SimData function, please cite The Cancer Genome Atlas Research Network (2013) in any publications.
Value
List containing:
counts |
matrix of simulated RNA-seq data with known list of DE and EE genes. |
treament |
a numeric vector specifying the treatment group structure in the new simulated counts matrix. |
genes.subset |
all genes included in the simulated matrix. The i-th entry corresponds to the i-th row of the simulated counts matrix. |
DE.genes |
vector of genes used for differential expression in the simulated counts matrix. |
DE.ind |
logical vector indicating which genes are differentially expressed in the simulated counts matrix. |
col |
vector of lenght 3*k.ind specifying which columns were used in the simulation. The first k.ind columns were used to simulate the first treatment group. The last 2*k.ind columns were used to simulate the second treatment group. If switch.trt equals TRUE, then the first 2*k.ind columns were used to simulate the first treatment group and the last k.ind columns were used to simulate the second treatment group. |
Author(s)
Samuel Benidt <sgbenidt@gmail.com>
Source
Benidt, S., and Nettleton, D. (2015). SimSeq: a nonparametric approach to simulation of RNA-sequence datasets. Bioinformatics. 31, 2131-2140.
The Cancer Genome Atlas Research Network (2013). Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature, 499(7456), 43-49.
https://tcga-data.nci.nih.gov/tcga/
Examples
data(kidney)
counts <- kidney$counts # Matrix of read counts from KIRC dataset
replic <- kidney$replic # Replic vector indicating paired columns
treatment <- kidney$treatment # Treatment vector indicating Non-Tumor or Tumor columns
nf <- apply(counts, 2, quantile, 0.75)
library(fdrtool)
## Not run:
### Example 1: Simulate Matrix with 1000 DE genes and 4000 EE genes
data.sim <- SimData(counts = counts, replic = replic, treatment = treatment,
sort.method = "paired", k.ind = 5, n.genes = 5000, n.diff = 1000,
norm.factors = nf)
### Example 2: Calculate weights vector beforehand to save run time in
### repeated simulations
sort.list <- SortData(counts = counts, treatment = treatment, replic = replic,
sort.method = "paired", norm.factors = nf)
counts <- sort.list$counts
replic <- sort.list$replic
treatment <- sort.list$treatment
nf <- sort.list$norm.factors
probs <- CalcPvalWilcox(counts, treatment, sort.method = "paired",
sorted = TRUE, norm.factors = nf, exact = FALSE)
weights <- 1 - fdrtool(probs, statistic = "pvalue", plot = FALSE, verbose = FALSE)$lfdr
data.sim <- SimData(counts = counts, replic = replic, treatment = treatment,
sort.method = "paired", k.ind = 5, n.genes = 5000, n.diff = 1000,
weights = weights, norm.factors = nf)
### Example 3: Specify which genes you want to use in the simulation
# Randomly sample genes or feed in the exact genes you wish to use
genes.diff <- sample(1:nrow(counts), size = 1000, prob = weights)
genes <- c(sample(1:nrow(counts)[-genes.diff], 4000), genes.diff)
data.sim <- SimData(counts = counts, replic = replic, treatment = treatment,
sort.method = "paired", k.ind = 5, genes.select = genes,
genes.diff = genes.diff, weights = weights, norm.factors = nf)
### Example 4: Simulate matrix with DE genes having log base 2 fold change greater than 1
# add one to counts matrix to avoid infinities when taking logs
tumor.mean <- rowMeans(log2((counts[, treatment == "Tumor"] + 1) %*%
diag(1/nf[treatment == "Tumor"])))
nontumor.mean <- rowMeans(log2((counts[, treatment == "Non-Tumor"] + 1) %*%
diag(1/nf[treatment == "Non-Tumor"])))
lfc <- tumor.mean - nontumor.mean
weights.zero <- abs(lfc) < 1
weights[weights.zero] <- 0
data.sim <- SimData(counts = counts, replic = replic, treatment = treatment,
sort.method = "paired", k.ind = 5, n.genes = 5000, n.diff = 1000,
weights = weights, norm.factors = nf)
### Example 5: Simulate three treatment groups:
### 3 Different types of Differential Expression Allowed
### First Group Diff, Second and Third group Equal
### Second Group Diff, First and Third group Equal
### Third Group Diff, First and Second group Equal
k <- 5 # Sample Size in Each treatment group
### Sample DE genes beforehand
N <- nrow(counts)
genes.de <- sample(1:N, size = 1000, prob = weights) # Sample all DE genes
DE1 <- genes.de[1:333] # Sample DE genes with first trt diff
DE2 <- genes.de[334:666] # Sample DE genes with sec trt diff
DE3 <- genes.de[667:1000] # Sample DE genes with third trt diff
EE <- sample( (1:N)[-genes.de], size = 4000) #Sample EE genes
genes.tot <- c(EE, genes.de)
genes.de1 <- union(DE2, EE) #Assign DE genes for first sim
genes.de2 <- union(DE2, DE3) #Assign DE genes for second sim
data.sim1 <- SimData(counts = counts, replic = replic, treatment = treatment,
sort.method = "paired", k.ind = k, genes.select = genes.tot,
genes.diff = genes.de1, weights = weights, norm.factors = nf)
#remove pairs of columns used in first simulation
cols.rm <- c(data.sim1$col[1:(2*k)], data.sim1$col[1:(2*k)] + 1)
counts.new <- counts[, -cols.rm]
nf.new <- nf[-cols.rm]
replic.new <- replic[-cols.rm]
treatment.new <- treatment[-cols.rm]
### Set switch.trt = TRUE for second sim
data.sim2 <- SimData(counts = counts.new, replic = replic.new, treatment = treatment.new,
sort.method = "paired", k.ind = k, genes.select = genes.tot,
genes.diff = genes.de2, weights = weights, norm.factors = nf.new,
switch.trt = TRUE)
### Remove first k.ind entries from first sim and combine two count matrices
counts.sim <- cbind(data.sim1$counts[, -(1:k)], data.sim2$counts)
### treatment group levels for simulated matrix
trt.grp <- rep(NA, 5000)
trt.grp[is.element(data.sim1$genes.subset, DE1)] <- "DE_First_Trt"
trt.grp[is.element(data.sim1$genes.subset, DE2)] <- "DE_Second_Trt"
trt.grp[is.element(data.sim1$genes.subset, DE3)] <- "DE_Third_Trt"
trt.grp[is.element(data.sim1$genes.subset, EE)] <- "EE"
## End(Not run)