| thin_2group {seqgendiff} | R Documentation | 
Binomial thinning in the two-group model.
Description
Given a matrix of real RNA-seq counts, this function will
randomly assign samples to one of two groups, draw
the log2-fold change in expression between two groups for each gene,
and add this signal to the RNA-seq counts matrix. The user may specify
the proportion of samples in each group, the proportion of null genes
(where the log2-fold change is 0),
and the signal function. This is a specific application of the
general binomial thinning approach implemented in thin_diff.
Usage
thin_2group(
  mat,
  prop_null = 1,
  signal_fun = stats::rnorm,
  signal_params = list(mean = 0, sd = 1),
  group_prop = 0.5,
  corvec = NULL,
  alpha = 0,
  permute_method = c("hungarian", "marriage"),
  type = c("thin", "mult")
)
Arguments
| mat | A numeric matrix of RNA-seq counts. The rows index the genes and the columns index the samples. | 
| prop_null | The proportion of genes that are null. | 
| signal_fun | A function that returns the signal. This should take as
input  | 
| signal_params | A list of additional arguments to pass to
 | 
| group_prop | The proportion of individuals that are in group 1. | 
| corvec | A vector of target correlations.  | 
| alpha | The scaling factor for the signal distribution from
Stephens (2016). If  | 
| permute_method | Should we use the Gale-Shapley algorithm
for stable marriages ( | 
| type | Should we apply binomial thinning ( | 
Details
The specific application of binomial thinning to the two-group model was used in Gerard and Stephens (2018) and Gerard and Stephens (2021). This is a specific case of the general method described in Gerard (2020).
Value
A list-like S3 object of class ThinData.
Components include some or all of the following:
- mat
- The modified matrix of counts. 
- designmat
- The design matrix of variables used to simulate signal. This is made by column-binding - design_fixedand the permuted version of- design_perm.
- coefmat
- A matrix of coefficients corresponding to - designmat.
- design_obs
- Additional variables that should be included in your design matrix in downstream fittings. This is made by column-binding the vector of 1's with - design_obs.
- sv
- A matrix of estimated surrogate variables. In simulation studies you would probably leave this out and estimate your own surrogate variables. 
- cormat
- A matrix of target correlations between the surrogate variables and the permuted variables in the design matrix. This might be different from the - target_coryou input because we pass it through- fix_corto ensure positive semi-definiteness of the resulting covariance matrix.
- matching_var
- A matrix of simulated variables used to permute - design_permif the- target_coris not- NULL.
Author(s)
David Gerard
References
- Gale, David, and Lloyd S. Shapley. "College admissions and the stability of marriage." The American Mathematical Monthly 69, no. 1 (1962): 9-15. doi:10.1080/00029890.1962.11989827. 
- Gerard, D., and Stephens, M. (2021). "Unifying and Generalizing Methods for Removing Unwanted Variation Based on Negative Controls." Statistica Sinica, 31(3), 1145-1166 doi:10.5705/ss.202018.0345. 
- David Gerard and Matthew Stephens (2018). "Empirical Bayes shrinkage and false discovery rate estimation, allowing for unwanted variation." Biostatistics, doi:10.1093/biostatistics/kxy029. 
- Gerard, D (2020). "Data-based RNA-seq simulations by binomial thinning." BMC Bioinformatics. 21(1), 206. doi:10.1186/s12859-020-3450-9. 
- Hornik K (2005). "A CLUE for CLUster Ensembles." Journal of Statistical Software, 14(12). doi:10.18637/jss.v014.i12. doi:10.18637/jss.v014.i12. 
- C. Papadimitriou and K. Steiglitz (1982), Combinatorial Optimization: Algorithms and Complexity. Englewood Cliffs: Prentice Hall. 
- Stephens, Matthew. "False discovery rates: a new deal." Biostatistics 18, no. 2 (2016): 275-294. doi:10.1093/biostatistics/kxw041. 
- Wakefield, Jon. "Bayes factors for genome-wide association studies: comparison with P-values." Genetic epidemiology 33, no. 1 (2009): 79-86. doi:10.1002/gepi.20359. 
See Also
- select_counts
- For subsampling the rows and columns of your real RNA-seq count matrix prior to applying binomial thinning. 
- thin_diff
- For the more general thinning approach. 
- ThinDataToSummarizedExperiment
- For converting a ThinData object to a SummarizedExperiment object. 
- ThinDataToDESeqDataSet
- For converting a ThinData object to a DESeqDataSet object. 
Examples
## Simulate data from given matrix of counts
## In practice, you would obtain Y from a real dataset, not simulate it.
set.seed(1)
nsamp <- 10
ngene <- 1000
Y <- matrix(stats::rpois(nsamp * ngene, lambda = 50), nrow = ngene)
thinout <- thin_2group(mat           = Y,
                       prop_null     = 0.9,
                       signal_fun    = stats::rexp,
                       signal_params = list(rate = 0.5))
## 90 percent of genes are null
mean(abs(thinout$coef) < 10^-6)
## Check the estimates of the log2-fold change
Ynew <- log2(t(thinout$mat + 0.5))
X    <- thinout$designmat
Bhat <- coef(lm(Ynew ~ X))["X", ]
plot(thinout$coefmat,
     Bhat,
     xlab = "log2-fold change",
     ylab = "Estimated log2-fold change")
abline(0, 1, col = 2, lwd = 2)