PoisMixClus {HTSCluster} | R Documentation |
Poisson mixture model estimation and model selection
Description
These functions implement the EM and CEM algorithms for parameter estimation in a Poisson mixture model for clustering high throughput sequencing observations (e.g., genes) for a single number of clusters (PoisMixClus
) or a sequence of cluster numbers (PoisMixClusWrapper
). Parameters are initialized using a Small-EM strategy as described in Rau et al. (2011) or the splitting small-EM strategy described in Papastamoulis et al. (2014), and model selection is performed using the ICL criteria. Note that these functions implement the PMM-I and PMM-II models described in Rau et al. (2011).
Usage
PoisMixClus(y, g, conds, norm = "TMM",
init.type = "small-em", init.runs = 1, init.iter = 10,
alg.type = "EM", cutoff = 10e-6, iter = 1000, fixed.lambda = NA,
equal.proportions = FALSE, prev.labels = NA,
prev.probaPost = NA, verbose = FALSE, interpretation = "sum",
EM.verbose = FALSE, wrapper = FALSE, subset.index = NA)
PoisMixClusWrapper(y, gmin = 1, gmax, conds,
norm = "TMM", gmin.init.type = "small-em",
init.runs = 1, init.iter = 10, split.init = TRUE, alg.type = "EM",
cutoff = 10e-6, iter = 1000, fixed.lambda = NA,
equal.proportions = FALSE, verbose = FALSE, interpretation = "sum",
EM.verbose = FALSE, subset.index = NA)
Arguments
y |
(n x q) matrix of observed counts for n observations and q variables |
g |
Number of clusters (a single value). If |
gmin |
The minimum number of clusters in a sequence to be tested. In cases where clusters are included with a fixed value
of lambda, |
gmax |
The maximum number of clusters in a sequence to be tested. In cases where clusters are included with a fixed value
of lambda, |
conds |
Vector of length q defining the condition (treatment group) for each variable (column) in |
norm |
The type of estimator to be used to normalize for differences in library size: (“ |
init.type |
Type of initialization strategy to be used (“ |
gmin.init.type |
Type of initialization strategy to be used for the minimum number of clusters in a sequence ( |
init.runs |
Number of runs to be used for the Small-EM strategy described in Rau et al. (2011), with a default value of 1 |
init.iter |
Number of iterations to be used within each run for the Small-EM strategry, with a default value of 10 |
split.init |
If |
alg.type |
Algorithm to be used for parameter estimation (“ |
cutoff |
Cutoff to declare algorithm convergence (in terms of differences in log likelihoods from one iteration to the next) |
iter |
Maximum number of iterations to be run for the chosen algorithm |
fixed.lambda |
If one (or more) clusters with fixed values of lambda is desired, a list containing vectors of length d (the number of conditions). specifying the fixed values of lambda for each fixed cluster. |
equal.proportions |
If |
prev.labels |
A vector of length n of cluster labels obtained from the previous run (g-1 clusters) to be used with the splitting small-EM strategy described in described in Papastamoulis et al. (2014). For other initialization strategies, this parameter takes the value NA |
prev.probaPost |
An n x (g-1) matrix of the conditional probabilities of each observation belonging to each of the g-1 clusters from the previous run, to be used with the splitting small-EM strategy of described in Papastamoulis et al. (2012). For other initialization strategies, this parameter takes the value NA |
verbose |
If |
interpretation |
If |
EM.verbose |
If |
subset.index |
Optional vector providing the indices of a subset of genes that should be used for the co-expression analysis (i.e., row indices
of the data matrix |
wrapper |
|
Details
Output of PoisMixClus
is an S3 object of class HTSCluster
, and output of PoisMixClusWrapper
is an S3 object
of class HTSClusterWrapper
.
In a Poisson mixture model, the data \mathbf{y}
are assumed to come from g distinct subpopulations (clusters), each of which is modeled separately; the overall population is thus a mixture of these subpopulations. In the case of a Poisson mixture model with g components, the model may be written as
f(\mathbf{y};g,\boldsymbol{\Psi}_g) = \prod_{i=1}^n \sum_{k=1}^g \pi_k \prod_{j=1}^{d}\prod_{l=1}^{r_j} P(y_{ijl} ; \boldsymbol{\theta}_k)
for i = 1, \ldots, n
observations in l = 1, \ldots, r_j
replicates of j = 1, \ldots, d
conditions (treatment groups), where P(\cdot)
is the standard Poisson density, \boldsymbol{\Psi}_g = (\pi_1,\ldots,\pi_{g-1}, \boldsymbol{\theta}^\prime)
, \boldsymbol{\theta}^\prime
contains all of the parameters in \boldsymbol{\theta}_1,\ldots,\boldsymbol{\theta}_g
assumed to be distinct, and \boldsymbol{\pi} = (\pi_1,\ldots,\pi_g)^\prime
are the mixing proportions such that \pi_k
is in (0,1) for all k and \sum_k \pi_k = 1
.
We consider the following parameterization for the mean \boldsymbol{\theta}_k = (\mu_{ijlk})
. We consider
\mu_{ijlk} = w_i s_{jl} \lambda_{jk}
where w_i
corresponds to the expression level of observation i, \boldsymbol{\lambda}_k = (\lambda_{1k},\ldots,\lambda_{dk})
corresponds to the clustering parameters that define the profiles of the genes in cluster k across all variables, and
s_{jl}
is the normalized library size (a fixed constant) for replicate l of condition j.
There are two approaches to estimating the parameters of a finite mixture model and obtaining a clustering of the data: the estimation approach (via the EM algorithm) and the clustering approach (via the CEM algorithm). Parameter initialization is done using a Small-EM strategy as described in Rau et al. (2011) via the emInit
function. Model selection may be performed using the BIC or ICL criteria, or the slope heuristics.
Value
lambda |
(d x g) matrix containing the estimate of |
pi |
Vector of length g containing the estimate of |
labels |
Vector of length n containing the cluster assignments of the n observations |
probaPost |
Matrix containing the conditional probabilities of belonging to each cluster for all observations |
log.like |
Value of log likelihood |
BIC |
Value of BIC criterion |
ICL |
Value of ICL criterion |
alg.type |
Estimation algorithm used; matches the argument |
norm |
Library size normalization factors used |
conds |
Conditions specified by user |
iterations |
Number of iterations run |
logLikeDiff |
Difference in log-likelihood between the last and penultimate iterations of the algorithm |
subset.index |
If provided by the user, the indices of subset of genes used for co-expression analyses |
loglike.all |
Log likelihoods calculated for each of the fitted models for cluster sizes |
capushe |
Results of capushe model selection, an object of class |
ICL.all |
ICL values calculated for each of the fitted models for cluster sizes |
ICL.results |
Object of class |
BIC.results |
Object of class |
DDSE.results |
Object of class |
Djump.results |
Object of class |
all.results |
List of objects of class |
model.selection |
Type of criteria used for model selection, equal to |
Note
Note that the fixed.lambda
argument is primarily intended to be used in the case when a single cluster is fixed to
have equal clustering parameters lambda across all conditions (i.e., \lambda_{j1}=\lambda_{1}=1
); this is particularly useful
when identifying genes with non-differential expression across all conditions (see the HTSDiff
R package for more details).
Alternatively, this argument could be used to specify a cluster for which genes are only expressed in a single condition
(e.g., \lambda_{11} = 1
and \lambda_{j1} = 0
for all j > 1
). Other possibilities could be considered,
but note that the fixed values of lambda must satisfy the constraint \sum_j \lambda_{jk}s_{j.} = 1
for all k
imposed in the model; if this is not the case, a warning message will be printed.
Author(s)
Andrea Rau
References
Anders, S. and Huber, W. (2010) Differential expression analysis for sequence count data. Genome Biology, 11(R106), 1-28.
Papastamoulis, P., Martin-Magniette, M.-L., and Maugis-Rabusseau, C. (2014). On the estimation of mixtures of Poisson regression models with large number of components. Computational Statistics and Data Analysis: 3rd special Issue on Advances in Mixture Models, DOI: 10.1016/j.csda.2014.07.005.
Rau, A., Maugis-Rabusseau, C., Martin-Magniette, M.-L., Celeux G. (2015). Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics, 31(9):1420-1427.
Rau, A., Celeux, G., Martin-Magniette, M.-L., Maugis-Rabusseau, C (2011). Clustering high-throughput sequencing data with Poisson mixture models. Inria Research Report 7786. Available at https://inria.hal.science/inria-00638082.
See Also
probaPost
for the calculation of the conditional probability of belonging to a cluster;
PoisMixMean
for the calculation of the per-cluster conditional mean of each observation;
logLikePoisMixDiff
for the calculation of the log likelihood of a Poisson mixture model;
emInit
and kmeanInit
for the Small-EM parameter initialization strategy
Examples
set.seed(12345)
## Simulate data as shown in Rau et al. (2011)
## Library size setting "A", high cluster separation
## n = 200 observations
simulate <- PoisMixSim(n = 200, libsize = "A", separation = "high")
y <- simulate$y
conds <- simulate$conditions
## Run the PMM model for g = 3
## "TC" library size estimate, EM algorithm
run <- PoisMixClus(y, g = 3, conds = conds, norm = "TC")
## Estimates of pi and lambda for the selected model
pi.est <- run$pi
lambda.est <- run$lambda
## Not run: PMM for 4 total clusters, with one fixed class
## "TC" library size estimate, EM algorithm
##
## run <- PoisMixClus(y, g = 3, norm = "TC", conds = conds,
## fixed.lambda = list(c(1,1,1)))
##
##
## Not run: PMM model for 4 clusters, with equal proportions
## "TC" library size estimate, EM algorithm
##
## run <- PoisMixClus(y, g = 4, norm = "TC", conds = conds,
## equal.proportions = TRUE)
##
##
## Not run: PMM model for g = 1, ..., 10 clusters, Split Small-EM init
##
## run1.10 <- PoisMixClusWrapper(y, gmin = 1, gmax = 10, conds = conds,
## norm = "TC")
##
##
## Not run: PMM model for g = 1, ..., 10 clusters, Small-EM init
##
## run1.10bis <- <- PoisMixClusWrapper(y, gmin = 1, gmax = 10, conds = conds,
## norm = "TC", split.init = FALSE)
##
##
## Not run: previous model equivalent to the following
##
## for(K in 1:10) {
## run <- PoisMixClus(y, g = K, conds = conds, norm = "TC")
## }