cmb.em {cmbClust}R Documentation

Conditional mixture modeling by EM algorithm

Description

Runs conditional mixture modeling and model-based clustering by EM algorithm (Expectation Maximization) for a prespecified variables conditioning order. Runs variable selection procedure (forward, backward or stepwise) to achieve a parsimonious mixture model.

Usage

cmb.em(x, order = NULL, l, K, method = "stepwise", id0 = NULL, n.em = 200, em.iter = 5,
EM.iter = 200, nk.min = NULL, max.spur=5, tol = 1e-06, silent = FALSE, Parallel = FALSE,
n.cores = 4)

Arguments

x

dataset matrix (n x p)

order

customized variables' conditioning order (length p)

l

order of polynomial regression model

K

number of clusters

method

variable selection method (options 'stepwise', 'forward', 'backward' and 'none')

id0

initial membership vector (length n)

n.em

number of short EM in an emEM procedure

em.iter

maximum number of iterations of short EM in an emEM procedure

EM.iter

maximum number of EM iterations

nk.min

spurious output control

max.spur

number of trials

tol

tolerance level

silent

output control (TRUE/FALSE)

Parallel

parallel computing (TRUE/FALSE)

n.cores

number of cores in parallel computing

Details

In conditional mixture modeling, each component is modeled by a product of conditional distributions with the means expressed by polynomial regression functions depending on other variables. Polynomial regression function order l and the number of clusters K are prespecified by user. The model's initialization can be determined by passing a group membership vector to the argument id, or obtained by the emEM algorithm (the default setting) in the function. There are two arguments related to the emEM procedure, the number of short EM n.em and maximum number of iterations for short EM em.iter. By default, the n.em = 200 and em.iter = 5. The method of variable selection can be specified as method = "stepwise", "forward", "backward", or "none" where method = none means no parsimonious procedure conducted. During the model fitting and variable selection phases, EM algorithm will be applied multiple times, where options EM.iter and tol are stopping criteria of EM iteration. The spurious output control argument nk.min, by default nk.min = (l x (p - 1) + 1) x 2, can be set by user. When spurious output is obtained, cmb.em will be rerun. The maximum number of rerunning is max.spur.

Notation: n - sample size, l - order of polynomial regression model, K - number of mixture components.

Value

data

input dataset

model

estimated regression models for each cluster (K x p matrix)

id

vector of estimated membership (length n)

loglik

estimated log likelihood

BIC

Bayesian Information Criterion

Pi

vector of estimated mixing proportions (length K)

tau

matrix of estimated posterior probabilities (n x K)

beta

matrix of estimated regression parameters (K x (p + p(p-1)l/2) )

s2

matrix of estimated variance (K x p)

order

applied conditioning order (length p)

n_pars

number of model parameters

References

Biernacki C., Celeux G., Govaert G. (2003). Choosing Starting Values for the EM Algorithm for Getting the Highest Likelihood in Multivariate Gaussian Mixture Models. Computational Statistics and Data Analysis, 41(3-4), pp. 561-575.

Examples

set.seed(1)
K <- 3
l <- 2
x <- as.matrix(iris[,-5])
id.true <- iris[,5]

# Run EM algorithm for fitting a conditioning mixture model 
obj <- cmb.em(x = x, order = c(1,3,2,4), l, K, method = "stepwise", silent = FALSE,
Parallel = FALSE)
id.cmb <- obj$id
table(id.true, id.cmb)
obj$BIC


[Package cmbClust version 0.0.1 Index]