select_counts {seqgendiff} | R Documentation |
Subsample the rows and columns of a count matrix.
Description
It is a good idea to subsample (each iteration) the genes and samples from
a real RNA-seq dataset prior to applying thin_diff
(and related functions) so that your conclusions are not dependent on the
specific structure of your dataset. This function is designed to efficiently
do this for you.
Usage
select_counts(
mat,
nsamp = ncol(mat),
ngene = nrow(mat),
gselect = c("random", "max", "mean_max", "custom"),
gvec = NULL,
filter_first = FALSE,
nskip = 0L
)
Arguments
mat |
A numeric matrix of RNA-seq counts. The rows index the genes and the columns index the samples. |
nsamp |
The number of samples (columns) to select from |
ngene |
The number of genes (rows) to select from |
gselect |
How should we select the subset of genes? Options include:
|
gvec |
A logical vector of length |
filter_first |
Should we first filter genes by the method of
Chen et al. (2016) ( |
nskip |
The number of median-maximally expressed genes to skip.
Not used if |
Details
The samples (columns) are chosen randomly, with each sample having
an equal probability of being in the sub-matrix. The genes are selected
according to one of four schemes (see the description of the gselect
argument).
If you have edgeR installed, then some functionality is provided for
filtering out the lowest expressed genes prior to applying subsampling
(see the filter_first
argument).
This filtering scheme is described in Chen et al. (2016).
If you want more control over this filtering, you should use
the filterByExpr
function from edgeR directly. You
can install edgeR by following instructions at
doi:10.18129/B9.bioc.edgeR.
Value
A numeric matrix, which is a ngene
by nsamp
sub-matrix
of mat
. If rownames(mat)
is NULL
, then the
row names of the returned matrix are the indices in mat
of the
selected genes. If colnames(mat)
is NULL
, then the
column names of the returned matrix are the indices in mat
of
the selected samples.
Author(s)
David Gerard
References
Chen, Yunshun, Aaron TL Lun, and Gordon K. Smyth. "From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline." F1000Research 5 (2016). doi:10.12688/f1000research.8987.2.
Examples
## Simulate data from given matrix of counts
## In practice, you would obtain mat from a real dataset, not simulate it.
set.seed(1)
n <- 100
p <- 1000
mat <- matrix(stats::rpois(n * p, lambda = 50), nrow = p)
## Subsample the matrix, then feed it into a thinning function
submat <- select_counts(mat = mat, nsamp = 10, ngene = 100)
thout <- thin_2group(mat = submat, prop_null = 0.5)
## The rownames and colnames (if NULL in mat) tell you which genes/samples
## were selected.
rownames(submat)
colnames(submat)