R: Subsample the rows and columns of a count matrix.

select_counts {seqgendiff}

R Documentation

Subsample the rows and columns of a count matrix.

Description

It is a good idea to subsample (each iteration) the genes and samples from a real RNA-seq dataset prior to applying thin_diff (and related functions) so that your conclusions are not dependent on the specific structure of your dataset. This function is designed to efficiently do this for you.

Usage

select_counts(
  mat,
  nsamp = ncol(mat),
  ngene = nrow(mat),
  gselect = c("random", "max", "mean_max", "custom"),
  gvec = NULL,
  filter_first = FALSE,
  nskip = 0L
)

Arguments

`mat`	A numeric matrix of RNA-seq counts. The rows index the genes and the columns index the samples.
`nsamp`	The number of samples (columns) to select from `mat`.
`ngene`	The number of genes (rows) to select from `mat`.
`gselect`	How should we select the subset of genes? Options include: `random` Randomly select the genes, with each gene having an equal probability of being included in the subsampled matrix. `max` Choose the `ngene` most median-expressed genes. Ties are broken by mean-expression. `mean_max` Choose the `ngene` most mean-expressed genes. `custom` A user-specified list of genes. If `gselect = "custom"` then `gvec` needs to be non-`NULL`.
`gvec`	A logical vector of length `nrow(mat)`. A `TRUE` in position `i` indicates inclusion into the smaller dataset. Hence, `sum(gvec)` should equal `ngene`.
`filter_first`	Should we first filter genes by the method of Chen et al. (2016) (`TRUE`) or not (`FALSE`)? If `TRUE` then the `edgeR` package should be installed.
`nskip`	The number of median-maximally expressed genes to skip. Not used if `gselect = "custom"`.

Details

The samples (columns) are chosen randomly, with each sample having an equal probability of being in the sub-matrix. The genes are selected according to one of four schemes (see the description of the gselect argument).

If you have edgeR installed, then some functionality is provided for filtering out the lowest expressed genes prior to applying subsampling (see the filter_first argument). This filtering scheme is described in Chen et al. (2016). If you want more control over this filtering, you should use the filterByExpr function from edgeR directly. You can install edgeR by following instructions at doi:10.18129/B9.bioc.edgeR.

Value

A numeric matrix, which is a ngene by nsamp sub-matrix of mat. If rownames(mat) is NULL, then the row names of the returned matrix are the indices in mat of the selected genes. If colnames(mat) is NULL, then the column names of the returned matrix are the indices in mat of the selected samples.

Author(s)

David Gerard

References

Chen, Yunshun, Aaron TL Lun, and Gordon K. Smyth. "From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline." F1000Research 5 (2016). doi:10.12688/f1000research.8987.2.

Examples

## Simulate data from given matrix of counts
## In practice, you would obtain mat from a real dataset, not simulate it.
set.seed(1)
n   <- 100
p   <- 1000
mat <- matrix(stats::rpois(n * p, lambda = 50), nrow = p)

## Subsample the matrix, then feed it into a thinning function
submat <- select_counts(mat = mat, nsamp = 10, ngene = 100)
thout  <- thin_2group(mat = submat, prop_null = 0.5)

## The rownames and colnames (if NULL in mat) tell you which genes/samples
## were selected.
rownames(submat)
colnames(submat)

[Package seqgendiff version 1.2.4 Index]