Init {HTSCluster} | R Documentation |
Parameter initialization for a Poisson mixture model.
Description
These functions implement a variety of initialization methods for the parameters of a Poisson mixture model: the Small EM initialization strategy (emInit
) described in Rau et al. (2011), a K-means initialization strategy (kmeanInit
) that is itself used to initialize the small EM strategy, the splitting small-EM initialization strategy (splitEMInit
) based on that described in Papastamoulis et al. (2014), and a function to initialize a small-EM strategy using the posterior probabilities (probaPostInit
) obtained from a previous run with one fewer cluster following the splitting strategy.
Usage
emInit(y, g, conds, norm, alg.type = "EM",
init.runs, init.iter, fixed.lambda, equal.proportions, verbose)
kmeanInit(y, g, conds, norm, fixed.lambda,
equal.proportions)
splitEMInit(y, g, conds, norm, alg.type, fixed.lambda,
equal.proportions, prev.labels, prev.probaPost, init.runs,
init.iter, verbose)
probaPostInit(y, g, conds, norm, alg.type = "EM",
fixed.lambda, equal.proportions, probaPost.init, init.iter,
verbose)
Arguments
y |
(n x q) matrix of observed counts for n observations and q variables |
g |
Number of clusters. If |
conds |
Vector of length q defining the condition (treatment group) for each variable (column) in |
norm |
The type of estimator to be used to normalize for differences in library size: (“ |
alg.type |
Algorithm to be used for parameter estimation (“ |
init.runs |
In the case of the Small-EM algorithm, the number of independent runs to be performed. In the case of the splitting Small-EM algorithm, the number of cluster splits to be performed in the splitting small-EM initialization. |
init.iter |
The number of iterations to run within each Small-EM algorithm |
fixed.lambda |
If one (or more) clusters with fixed values of lambda is desires, a list containing vectors of length d (the number of conditions). Note that the values of lambda chosen must satisfy the constraint noted in the technical report. |
equal.proportions |
If |
prev.labels |
A vector of length n of cluster labels obtained from the previous run (g-1 clusters) |
prev.probaPost |
An n x (g-1) matrix of the conditional probabilities of each observation belonging to each of the g-1 clusters from the previous run |
probaPost.init |
An n x (g) matrix of the conditional probabilities of each observation belonging to each of the
g clusters following the splitting strategy in the |
verbose |
If |
Details
In practice, the user will not directly call the initialization functions described here; they are indirectly called
for a single number of clusters through the PoisMixClus
function (via init.type
) or via the
PoisMixClusWrapper
function for a sequence of cluster numbers (via gmin.init.type
and split.init
).
To initialize parameter values for the EM and CEM algorithms, for the Small-EM strategy (Biernacki et al., 2003) we use the emInit
function as follows. For a given number of independent runs (given by init.runs
), the following procedure is used to obtain parameter values: first, a K-means algorithm (MacQueen, 1967) is run to partition the data into g
clusters (\hat{\boldsymbol{z}}^{(0)}
). Second, initial parameter values \boldsymbol{\pi}^{(0)}
and \boldsymbol{\lambda}^{(0)}
are calculated (see Rau et al. (2011) for details). Third, a given number of iterations of an EM algorithm are run (defined by init.iter
), using \boldsymbol{\pi}^{(0)}
and \boldsymbol{\lambda}^{(0)}
as initial values. Finally, among the init.runs
sets of parameter values, we use \hat{\boldsymbol{\lambda}}
and \hat{\boldsymbol{\pi}}
corresponding to the highest log likelihood or completed log likelihood to initialize the subsequent full EM or CEM algorithms, respectively.
For the splitting small EM initialization strategy, we implement an approach similar to that described in Papastamoulis et al. (2014), where the cluster from the previous run (with g-1 clusters) with the largest entropy is chosen to be split into two new clusters, followed by a small EM run as described above.
Value
pi.init |
Vector of length |
lambda.init |
(d x |
lambda |
(d x |
pi |
Vector of length |
log.like |
Log likelihood arising from the splitting initialization and small EM run for a single split. |
Author(s)
Andrea Rau
References
Anders, S. and Huber, W. (2010) Differential expression analysis for sequence count data. Genome Biology, 11(R106), 1-28.
Biernacki, C., Celeux, G., Govaert, G. (2003) Choosing starting values for the EM algorithm for getting the highest likelhiood in multivariate Gaussian mixture models. Computational Statistics and Data Analysis, 41(1), 561-575.
MacQueen, J. B. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, number 1, pages 281-297. Berkeley, University of California Press.
Papastamoulis, P., Martin-Magniette, M.-L., and Maugis-Rabusseau, C. (2014). On the estimation of mixtures of Poisson regression models with large number of components. Computational Statistics and Data Analysis: 3rd special Issue on Advances in Mixture Models, DOI: 10.1016/j.csda.2014.07.005.
Rau, A., Celeux, G., Martin-Magniette, M.-L., Maugis-Rabusseau, C. (2011). Clustering high-throughput sequencing data with Poisson mixture models. Inria Research Report 7786. Available at https://inria.hal.science/inria-00638082.
Rau, A., Maugis-Rabusseau, C., Martin-Magniette, M.-L., Celeux G. (2015). Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics, 31(9):1420-1427.
Robinson, M. D. and Oshlack, A. (2010) A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11(R25).
See Also
PoisMixClus
for Poisson mixture model estimation for a given number of clusters,
PoisMixClusWrapper
for Poisson mixture model estimation and model selection for a sequence of cluster numbers.
Examples
set.seed(12345)
## Simulate data as shown in Rau et al. (2011)
## Library size setting "A", high cluster separation
## n = 500 observations
simulate <- PoisMixSim(n = 500, libsize = "A", separation = "high")
y <- simulate$y
conds <- simulate$conditions
## Calculate initial values for lambda and pi using the Small-EM
## initialization (4 classes, PMM-II model with "TC" library size)
##
## init.values <- emInit(y, g = 4, conds,
## norm = "TC", alg.type = "EM",
## init.runs = 50, init.iter = 10, fixed.lambda = NA,
## equal.proportions = FALSE, verbose = FALSE)
## pi.init <- init.values$pi.init
## lambda.init <- init.values$lambda.init