calcVDJcounts {rdi} | R Documentation |
Calculate repertoire counts
Description
Create count matrices for a set of repertoires
Usage
calcVDJcounts(genes, seqAnnot, select = NULL, combine = NULL,
vdjDrop = NULL, splitBy = NULL, simplifyNames = TRUE,
splitCommas = FALSE, ...)
Arguments
genes |
vector (or matrix) containing the gene calls for each sequence. If genes is a matrix, counts will be calculated for each column of 'genes', and the resulting count matrices will be concatenated. See Details. |
seqAnnot |
matrix containing repertoire annotations. |
select |
a list containing definitions of which repertoires to use. See Details. |
combine |
a list defining repertoires that should be grouped together. See Details. |
vdjDrop |
a list specifying specific genes to exclude from analysis. See Details. |
splitBy |
the columns in seqAnnot to use for splitting repertoires. Default is to use all columns in seqAnnot. |
simplifyNames |
logical; if true, any columns of seqAnnot where all selected (and collapsed) sequences share the same value will not be used to make the names of sequenceCodes. |
splitCommas |
logical; if true, seqAnnot is assumed to contain. comma-separated lists of sequence annotations, which will be split up before generating sequence codes. Note: setting this to TRUE will make processing much slower. |
... |
additional parameters; these are ignored by this function. |
Details
In most cases, genes
will be a single vector or one-column matrix. However,
there are some cases where a row of seqAnnot
corresponds to two (or more) genes
(e.g. the V and J gene segments of a single immune sequence). Rather than make multiple
rows for each gene, the calcVDJcounts
function provides the option to provide
a multi-column matrix for genes
. The counts for each column will be tallied
separately, and are then concatenated.
To ensure equal variance across all repertoires, the default RDI metric uses
subsampling to ensure that all repertoires have the same number of sequences. The
default RDI metric subsamples all repertoires to the size of the smallest repertoire,
which may result in a loss of power for comparisons between larger repertoires.
In order to increase power for various tests, it is often useful to only calculate
the repertoire counts for a subset of the repertoires in seqAnnot. This can be done by
using the select
and combine
parameters to specify which
repertoires to include in the analysis.
Both parameters are lists containing entries
with the same name as one of the columns of seqAnnot. For select
, each entry is
a vector defining which values to include (e.g., to include only Visit 1 and 3, you
might specify select=list(visit=c("V1","V3"))
, where the 'visit'
column
in seqAnnot contains the values "V1"
,"V2"
, and "V3"
). In this
case, any rows of genes
and seqAnnot
that come from a repertoire not
specified in select
will be discarded. By default, if a select
code is
not specified for a column in seqAnnot
, all values from that column will be
included.
The combine
parameter works in a similar fashion, but instead of a vector
describing which parameters to include, you can specify a vector of regular
expressions, and any values of the seqAnnot
column that match the regular
expression will be combined into a single repertoire (e.g. to combine visits 1 and 3
into a single repertoire, you might specify combine=list(visit="V[13]")
).
The vdjDrop
parameter is also useful for limiting sequences. Like
select
and combine
, this is a named list, with entries corresponding to
the columns of genes
. Each entry of vdjDrop
is a vector of gene segment
names to remove from the analysis. All sequences containing those genes are removed
from the analysis before subsampling.
Once unwanted rows have been removed, the columns of seqAnnot
are concatenated
to generate "repertoire" labels for each row. The repertoire labels are then used
to split the rows of genes
, and gene prevalence is tallied within a repertoire.
By default, columns of seqAnnot
that are constant after subsetting will not be
included in the label. However, this can be controlled by the simplifyNames
parameter. If simplifyNames
is FALSE, all columns of seqAnnot
are
included when generating labels.
Value
A matrix where each row represents a gene, and each column represents a repertoire.
Examples
#create genes
genes = sample(letters, 10000, replace=TRUE)
#create sequence annotations
seqAnnot = data.frame(donor = sample(1:4, 10000, replace=TRUE),
visit = sample(c("V1","V2","V3"), 10000, replace=TRUE),
cellType = sample(c("B","T"), 10000, replace=TRUE)
)
##generate repertoire counts for all repertoires
cts = calcVDJcounts(genes,seqAnnot)
##Only include visit 1
cts = calcVDJcounts(genes,seqAnnot, select=list(visit="V1"))
## Just T cell repertoires, combining visit 1 and 3 together, and dropping visit 2
cts = calcVDJcounts(genes,seqAnnot,
select=list(cellType="T", visit=c("V1","V3")),
combine=list(visit="V[13]"))