R: Conditional mixture modeling by EM algorithm

cmb.em {cmbClust}

R Documentation

Conditional mixture modeling by EM algorithm

Description

Runs conditional mixture modeling and model-based clustering by EM algorithm (Expectation Maximization) for a prespecified variables conditioning order. Runs variable selection procedure (forward, backward or stepwise) to achieve a parsimonious mixture model.

Usage

cmb.em(x, order = NULL, l, K, method = "stepwise", id0 = NULL, n.em = 200, em.iter = 5,
EM.iter = 200, nk.min = NULL, max.spur=5, tol = 1e-06, silent = FALSE, Parallel = FALSE,
n.cores = 4)

Arguments

`x`	dataset matrix (n x p)
`order`	customized variables' conditioning order (length p)
`l`	order of polynomial regression model
`K`	number of clusters
`method`	variable selection method (options 'stepwise', 'forward', 'backward' and 'none')
`id0`	initial membership vector (length n)
`n.em`	number of short EM in an emEM procedure
`em.iter`	maximum number of iterations of short EM in an emEM procedure
`EM.iter`	maximum number of EM iterations
`nk.min`	spurious output control
`max.spur`	number of trials
`tol`	tolerance level
`silent`	output control (TRUE/FALSE)
`Parallel`	parallel computing (TRUE/FALSE)
`n.cores`	number of cores in parallel computing

Details

In conditional mixture modeling, each component is modeled by a product of conditional distributions with the means expressed by polynomial regression functions depending on other variables. Polynomial regression function order l and the number of clusters K are prespecified by user. The model's initialization can be determined by passing a group membership vector to the argument id, or obtained by the emEM algorithm (the default setting) in the function. There are two arguments related to the emEM procedure, the number of short EM n.em and maximum number of iterations for short EM em.iter. By default, the n.em = 200 and em.iter = 5. The method of variable selection can be specified as method = "stepwise", "forward", "backward", or "none" where method = none means no parsimonious procedure conducted. During the model fitting and variable selection phases, EM algorithm will be applied multiple times, where options EM.iter and tol are stopping criteria of EM iteration. The spurious output control argument nk.min, by default nk.min = (l x (p - 1) + 1) x 2, can be set by user. When spurious output is obtained, cmb.em will be rerun. The maximum number of rerunning is max.spur.

Notation: n - sample size, l - order of polynomial regression model, K - number of mixture components.

Value

`data`	input dataset
`model`	estimated regression models for each cluster (K x p matrix)
`id`	vector of estimated membership (length n)
`loglik`	estimated log likelihood
`BIC`	Bayesian Information Criterion
`Pi`	vector of estimated mixing proportions (length K)
`tau`	matrix of estimated posterior probabilities (n x K)
`beta`	matrix of estimated regression parameters (K x (p + p(p-1)l/2) )
`s2`	matrix of estimated variance (K x p)
`order`	applied conditioning order (length p)
`n_pars`	number of model parameters

References

Biernacki C., Celeux G., Govaert G. (2003). Choosing Starting Values for the EM Algorithm for Getting the Highest Likelihood in Multivariate Gaussian Mixture Models. Computational Statistics and Data Analysis, 41(3-4), pp. 561-575.

Examples

set.seed(1)
K <- 3
l <- 2
x <- as.matrix(iris[,-5])
id.true <- iris[,5]

# Run EM algorithm for fitting a conditioning mixture model 
obj <- cmb.em(x = x, order = c(1,3,2,4), l, K, method = "stepwise", silent = FALSE,
Parallel = FALSE)
id.cmb <- obj$id
table(id.true, id.cmb)
obj$BIC

[Package cmbClust version 0.0.1 Index]