| de_analysis {fastTopics} | R Documentation |
Differential Expression Analysis using a Topic Model
Description
Implements methods for differential expression analysis using a topic model. These methods are motivated by gene expression studies, but could have other uses, such as identifying “key words” for topics.
Usage
de_analysis(
fit,
X,
s = rowSums(X),
pseudocount = 0.01,
fit.method = c("scd", "em", "mu", "ccd", "glm"),
shrink.method = c("ash", "none"),
lfc.stat = "le",
control = list(),
verbose = TRUE,
...
)
de_analysis_control_default()
Arguments
fit |
An object of class “poisson_nmf_fit” or
“multinom_topic_model_fit”, or an n x k matrix of topic
proportions, where k is the number of topics. (The elements in each
row of this matrix should sum to 1.) If a Poisson NMF fit is
provided as input, the corresponding multinomial topic model fit is
automatically recovered using |
X |
The n x m counts matrix. It can be a sparse matrix (class
|
s |
A numeric vector of length n determining how the rates are
scaled in the Poisson models. See “Details” for guidance on
the choice of |
pseudocount |
Observations with this value are added to the counts matrix to stabilize maximum-likelihood estimation. |
fit.method |
Method used to fit the Poisson models. Note that
|
shrink.method |
Method used to stabilize the posterior mean
LFC estimates. When |
lfc.stat |
The log-fold change statistics returned:
|
control |
A list of parameters controlling behaviour of the optimization and Monte Carlo algorithms. See ‘Details’. |
verbose |
When |
... |
When |
Details
The methods are based on the Poisson model
x_i ~ Poisson(\lambda_i),
in which the Poisson rates are
\lambda_i = \sum_{j=1}^k s_i l_{ij} f_j,
the l_{ik}
are the topic proportions and the f_j are the unknowns to be
estimated. This model is applied separately to each column of
X. When s_i (specified by input argument s) is
equal the total count in row i (this is the default), the Poisson
model will closely approximate a binomial model of the count data,
and the unknowns f_j will approximate binomial
probabilities. (The Poisson approximation to the binomial is most
accurate when the total counts rowSums(X) are large and the
unknowns f_j are small.) Other choices for s are
possible, and implement different normalization schemes.
To allow for some flexibility, de_analysis allows for the
log-fold change to be measured in several ways.
One option is to compare against the probability under the null
model: LFC(j) = log2(f_j/f_0), where f_0 is the single
parameter in the Poisson model x_i ~ Poisson(\lambda_i) with
rates \lambda_i = s_i f_0. This LFC definition is chosen with
lfc.stat = "vsnull".
Another option is to compare against a chosen topic, k: LFC(j)
= log2(f_j/f_k). By definition, LFC(k) is zero, and
statistics such as z-scores and p-values for topic k are set to
NA. This LFC definition is selected by setting
lfc.stat = k.
A final option (which is the default) computes the “least
extreme” LFC, defined as LFC(j) = log2(f_j/f_k) such that
k is the topic other than j that gives the ratio
f_j/f_k closest to 1. This option is chosen with
lfc.stat = "le".
We recommend setting shrink.method = "ash", which uses the
“adaptive shrinkage” method (Stephens, 2016) to improve
accuracy of the posterior mean estimates and z-scores. We follow
the settings used in lfcShrink from the ‘DESeq2’
package, with type = "ashr".
Note that all LFC statistics are defined using the base-2 logarithm following the conventioned used in differential expression analysis.
The control argument is a list in which any of the
following named components will override the default optimization
algorithm settings (as they are defined by
de_analysis_control_default):
numiterMaximum number of iterations performed in fitting the Poisson models. When
fit.method = "glm", this is passed as argumentmaxitto theglmfunction.minvalA small, positive number. All topic proportions less than this value and greater than
1 - minvalare set to this value.tolControls the convergence tolerance for fitting the Poisson models. When
fit.method = "glm", this is passed as argumentepsilonto functionglm.conf.levelThe size of the highest posterior density (HPD) intervals. Should be a number greater than 0 and less than 1.
nsNumber of Monte Carlo samples simulated by random-walk MCMC for estimating posterior LFC quantities.
rwThe standard deviation of the normal density used to propose new states in the random-walk MCMC.
epsA small, non-negative number added to the terms inside the logarithms to avoid computing logarithms of zero.
ncNumber of threads used in the multithreaded computations. This controls both (a) the number of RcppParallel threads used to fit the factors in the Poisson models, and (b) the number of cores used in
mclapplyfor the MCMC simulation step. Note that mclapply relies on forking hence is not available on Windows; will return an error on Windows unlessnc = 1. Also note that settingnc > 1copies the contents of memorynctimes, which can lead to poor performance if the total resident memory required exceeds available physical memory.nc.blasNumber of threads used in the numerical linear algebra library (e.g., OpenBLAS), if available. For best performance, we recommend setting this to 1 (i.e., no multithreading).
nsplitThe number of data splits used in the multithreaded computations (only relevant when
nc > 1). More splits increase the granularity of the progress bar, but can also slow down the mutithreaded computations by introducing more overhead in the call topblapply.
Value
A list with the following elements:
est |
The log-fold change maximum-likelihood estimates. |
postmean |
Posterior mean LFC estimates. |
lower |
Lower limits of estimated HPD intervals. Note that these are not updated by the shrinkage step. |
upper |
Upper limits of estimated HPD intervals. Note that these are not updated by the shrinkage step. |
z |
z-scores for posterior mean LFC estimates. |
lpval |
-log10 two-tailed p-values obtained from the
z-scores. When |
svalue |
s-values returned by
|
lfsr |
When |
ash |
When |
F |
Maximum-likelihood estimates of the Poisson model parameters. |
f0 |
Maximum-likelihood estimates of the null model parameters. |
ar |
A vector containing the Metropolis acceptance ratios from each MCMC run. |
References
Stephens, M. (2016). False discovery rates: a new deal. Biostatistics 18(2), kxw041. doi:10.1093/biostatistics/kxw041
Zhu, A., Ibrahim, J. G. and Love, M. I. (2019). Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences. Bioinformatics 35(12), 2084–2092.
Examples
# Perform a differential expression (DE) analysis using the previously
# fitted multinomial topic model. Note that the de_analysis call could
# take several minutes to complete.
set.seed(1)
data(pbmc_facs)
de <- de_analysis(pbmc_facs$fit,pbmc_facs$counts)
# Compile the DE analysis results for topic 4 into a table, and
# rank the genes by the posterior mean log-fold change, limiting to
# DE genes identified with low lfsr ("local false sign rate").
k <- 4
dat <- data.frame(postmean = de$postmean[,k],
z = de$z[,k],
lfsr = de$lfsr[,k])
rownames(dat) <- with(pbmc_facs$genes,paste(symbol,ensembl,sep = "_"))
dat <- subset(dat,lfsr < 0.01)
dat <- dat[order(dat$postmean,decreasing = TRUE),]
# Genes at the top of this ranking are genes that are much more
# highly expressed in the topic compared to other topics.
head(dat,n = 10)
# The genes at the bottom of the ranking are genes that are much less
# expressed in the topic.
tail(dat,n = 10)
# Create a volcano plot from the DE results for topic 4.
volcano_plot(de,k = k,ymax = 50,labels = pbmc_facs$genes$symbol)