calcRDI {rdi} | R Documentation |
Calculate repertoire distances
Description
Calculate repertoire distances from a matrix of vdjCounts
Usage
calcRDI(vdjCounts, distMethod = c("euclidean", "cor"), subsample = TRUE,
nIter = 100, constScale = TRUE, units = c("lfc", "pct"), ...)
Arguments
vdjCounts |
a matrix of repertoire counts, as created by calcVDJCounts |
distMethod |
one of c("euclidean","cor") determining how to calculate the distance from the matrix of vdj counts. See Details. |
subsample |
logical; if true, all repertoires are subsampled to be equal size. |
nIter |
value defining how many times the subsampling should be repeated. Only used if subsample is TRUE. |
constScale |
logical; if |
units |
One of "lfc" or "pct". This determines the method used for transforming the repertoire counts. See Details. |
... |
additional parameters; these are ignored by this function. |
Details
There are two options for distance methods, "euclidean" and "cor". Euclidean refers to
standard euclidean distance, and is the standard for the RDI measure described in
(Bolen et al. Bioinformatics 2016). In contrast, cor refers to a correlation-based
distance metric, where the distance is defined as (1-correlation)
between each
column of vdjCounts.
The units
parameter is used to determine the transformation function for the
repertoire counts. If units='lfc'
(default), then the arcsinh transformation
is applied to the count matrix, resulting in a distance metric which
will scale with the average log fold change of each gene. In contrast,
units='pct'
will result in no transformation of the count matrix, and distances
will be proportional to the average percent change of each gene, instead. Note that
"units" is a bit of a misnomer, as the distance metric doesn't actually represent the
true log-fold or percent change in the repertoires. In order to actually estimate
these parameters, refer to the rdiModel and convertRDI
functions.
Value
A dissimilarity structure containing distances between repertoires, averaged across each subsampe run. In addition to the standard attributes in a dist object, three additional attributes are defined as follows:
ngenes | integers, the number of genes in each column of "genes" that were included in at least one repertoire. |
nseq | integer, the number of sequences used after subsampling
the repertoires. If subsample=FALSE , this is not
defined. |
units | string, either "lfc" or "pct", depending on the "units" in the original call |
Examples
#create genes
genes = sample(letters, 10000, replace=TRUE)
#create sequence annotations
seqAnnot = data.frame(donor = sample(1:4, 10000, replace=TRUE),
cellType = sample(c("B","T"), 10000, replace=TRUE)
)
##generate repertoire counts
cts = calcVDJcounts(genes,seqAnnot)
##calculate RDI
d = calcRDI(cts)
##calculate RDI in percent space
d_pct = calcRDI(cts,units="pct")
##convert RDI to actual 'lfc' estimates and compare
dtrue = convertRDI(d)$pred
plot(d, dtrue)