de_analysis {fastTopics} | R Documentation |
Differential Expression Analysis using a Topic Model
Description
Implements methods for differential expression analysis using a topic model. These methods are motivated by gene expression studies, but could have other uses, such as identifying “key words” for topics.
Usage
de_analysis(
fit,
X,
s = rowSums(X),
pseudocount = 0.01,
fit.method = c("scd", "em", "mu", "ccd", "glm"),
shrink.method = c("ash", "none"),
lfc.stat = "le",
control = list(),
verbose = TRUE,
...
)
de_analysis_control_default()
Arguments
fit |
An object of class “poisson_nmf_fit” or
“multinom_topic_model_fit”, or an n x k matrix of topic
proportions, where k is the number of topics. (The elements in each
row of this matrix should sum to 1.) If a Poisson NMF fit is
provided as input, the corresponding multinomial topic model fit is
automatically recovered using |
X |
The n x m counts matrix. It can be a sparse matrix (class
|
s |
A numeric vector of length n determining how the rates are
scaled in the Poisson models. See “Details” for guidance on
the choice of |
pseudocount |
Observations with this value are added to the counts matrix to stabilize maximum-likelihood estimation. |
fit.method |
Method used to fit the Poisson models. Note that
|
shrink.method |
Method used to stabilize the posterior mean
LFC estimates. When |
lfc.stat |
The log-fold change statistics returned:
|
control |
A list of parameters controlling behaviour of the optimization and Monte Carlo algorithms. See ‘Details’. |
verbose |
When |
... |
When |
Details
The methods are based on the Poisson model
x_i ~ Poisson(\lambda_i),
in which the Poisson rates are
\lambda_i = \sum_{j=1}^k s_i l_{ij} f_j,
the l_{ik}
are the topic proportions and the f_j
are the unknowns to be
estimated. This model is applied separately to each column of
X
. When s_i
(specified by input argument s
) is
equal the total count in row i (this is the default), the Poisson
model will closely approximate a binomial model of the count data,
and the unknowns f_j
will approximate binomial
probabilities. (The Poisson approximation to the binomial is most
accurate when the total counts rowSums(X)
are large and the
unknowns f_j
are small.) Other choices for s
are
possible, and implement different normalization schemes.
To allow for some flexibility, de_analysis
allows for the
log-fold change to be measured in several ways.
One option is to compare against the probability under the null
model: LFC(j) = log2(f_j/f_0)
, where f_0
is the single
parameter in the Poisson model x_i ~ Poisson(\lambda_i)
with
rates \lambda_i = s_i f_0
. This LFC definition is chosen with
lfc.stat = "vsnull"
.
Another option is to compare against a chosen topic, k: LFC(j)
= log2(f_j/f_k)
. By definition, LFC(k)
is zero, and
statistics such as z-scores and p-values for topic k are set to
NA
. This LFC definition is selected by setting
lfc.stat = k
.
A final option (which is the default) computes the “least
extreme” LFC, defined as LFC(j) = log2(f_j/f_k)
such that
k
is the topic other than j
that gives the ratio
f_j/f_k
closest to 1. This option is chosen with
lfc.stat = "le"
.
We recommend setting shrink.method = "ash"
, which uses the
“adaptive shrinkage” method (Stephens, 2016) to improve
accuracy of the posterior mean estimates and z-scores. We follow
the settings used in lfcShrink
from the ‘DESeq2’
package, with type = "ashr"
.
Note that all LFC statistics are defined using the base-2 logarithm following the conventioned used in differential expression analysis.
The control
argument is a list in which any of the
following named components will override the default optimization
algorithm settings (as they are defined by
de_analysis_control_default
):
numiter
Maximum number of iterations performed in fitting the Poisson models. When
fit.method = "glm"
, this is passed as argumentmaxit
to theglm
function.minval
A small, positive number. All topic proportions less than this value and greater than
1 - minval
are set to this value.tol
Controls the convergence tolerance for fitting the Poisson models. When
fit.method = "glm"
, this is passed as argumentepsilon
to functionglm
.conf.level
The size of the highest posterior density (HPD) intervals. Should be a number greater than 0 and less than 1.
ns
Number of Monte Carlo samples simulated by random-walk MCMC for estimating posterior LFC quantities.
rw
The standard deviation of the normal density used to propose new states in the random-walk MCMC.
eps
A small, non-negative number added to the terms inside the logarithms to avoid computing logarithms of zero.
nc
Number of threads used in the multithreaded computations. This controls both (a) the number of RcppParallel threads used to fit the factors in the Poisson models, and (b) the number of cores used in
mclapply
for the MCMC simulation step. Note that mclapply relies on forking hence is not available on Windows; will return an error on Windows unlessnc = 1
. Also note that settingnc > 1
copies the contents of memorync
times, which can lead to poor performance if the total resident memory required exceeds available physical memory.nc.blas
Number of threads used in the numerical linear algebra library (e.g., OpenBLAS), if available. For best performance, we recommend setting this to 1 (i.e., no multithreading).
nsplit
The number of data splits used in the multithreaded computations (only relevant when
nc > 1
). More splits increase the granularity of the progress bar, but can also slow down the mutithreaded computations by introducing more overhead in the call topblapply
.
Value
A list with the following elements:
est |
The log-fold change maximum-likelihood estimates. |
postmean |
Posterior mean LFC estimates. |
lower |
Lower limits of estimated HPD intervals. Note that these are not updated by the shrinkage step. |
upper |
Upper limits of estimated HPD intervals. Note that these are not updated by the shrinkage step. |
z |
z-scores for posterior mean LFC estimates. |
lpval |
-log10 two-tailed p-values obtained from the
z-scores. When |
svalue |
s-values returned by
|
lfsr |
When |
ash |
When |
F |
Maximum-likelihood estimates of the Poisson model parameters. |
f0 |
Maximum-likelihood estimates of the null model parameters. |
ar |
A vector containing the Metropolis acceptance ratios from each MCMC run. |
References
Stephens, M. (2016). False discovery rates: a new deal. Biostatistics 18(2), kxw041. doi:10.1093/biostatistics/kxw041
Zhu, A., Ibrahim, J. G. and Love, M. I. (2019). Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences. Bioinformatics 35(12), 2084–2092.
Examples
# Perform a differential expression (DE) analysis using the previously
# fitted multinomial topic model. Note that the de_analysis call could
# take several minutes to complete.
set.seed(1)
data(pbmc_facs)
de <- de_analysis(pbmc_facs$fit,pbmc_facs$counts)
# Compile the DE analysis results for topic 4 into a table, and
# rank the genes by the posterior mean log-fold change, limiting to
# DE genes identified with low lfsr ("local false sign rate").
k <- 4
dat <- data.frame(postmean = de$postmean[,k],
z = de$z[,k],
lfsr = de$lfsr[,k])
rownames(dat) <- with(pbmc_facs$genes,paste(symbol,ensembl,sep = "_"))
dat <- subset(dat,lfsr < 0.01)
dat <- dat[order(dat$postmean,decreasing = TRUE),]
# Genes at the top of this ranking are genes that are much more
# highly expressed in the topic compared to other topics.
head(dat,n = 10)
# The genes at the bottom of the ranking are genes that are much less
# expressed in the topic.
tail(dat,n = 10)
# Create a volcano plot from the DE results for topic 4.
volcano_plot(de,k = k,ymax = 50,labels = pbmc_facs$genes$symbol)