ClustMMDD-package {ClustMMDD}R Documentation

ClustMMDD : Clustering by Mixture Models for Discrete Data.


ClustMMDD stands for "Clustering by Mixture Models for Discrete Data". This package deals with the two-fold problem of variable selection and model-based unsupervised classification in discrete settings. Variable selection and classification are simultaneously solved via a model selection procedure using penalized criteria: Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Integrated Completed Log-likelihood (ICL) or a general criterion with penalty function to be data-driven calibrated.


Package: ClustMMDD
Type: Package
Version: 1.0.1
Date: 2015-05-18
License: GPL (>= 2)

In this package, K and S are respectively the number of clusters and the subset of variables that are relevant for clustering purposes. We assume that a clustering variable has different probability distributions in at least two clusters, and a non-clustering variable has the same distribution in all clusters. We consider a general situation with data described by P random variables X^l, l=1,\cdots,P, where each variable X^l is an unordered set ≤ft\{X^{l,1},\cdots,X^{l,ploidy}\right\} of ploidy categorical variables. For all l, the random variables X^{l,1},\cdots,X^{l,ploidy} take their values in the same set of levels. A typical example of such data comes from population genetics where each genotype of a diploid individual is constituted by ploidy = 2 unordered alleles.

The two-fold problem of clustering and variable selection is seen as a model selection problem. A specific collection of competing models associated to different values of (K, S) is defined, and are compared using penalized criteria. The penalized criteria are of the form



The penalty functions used in this package are the following, where dim≤ft(K,S\right) is the dimension (number of free parameters) of the model defined by ≤ft(K,S\right) :

The maximum log-likelihood is estimated via the Expectation and Maximisation algorithm. The maximum a posteriori classification is derived from the estimated parameters of the selected model.


Wilson Toussile

Maintainer: Wilson Toussile <>


See Also

The main functions :


Compute an approximation of the maximum likelihood estimates of parameters using Expectation and Maximization algorithm, for a given value of (K,S). The maximum a posteriori classification is then derived.


Gather the most competitive models using a backward-stepwise strategy.


Perform the data driven calibration of the penalty function via an estimation of λ. Two values are proposed and a graphic is proposed to help user in making a choice.


Perform the selection of the number K of clusters for a given subset of clustering variables.


Perform a model selection from a collection of competing models.



#Calibration of the penalty function
outDimJump = dimJump.R(genotype2_ExploredModels, N = 1000, h = 5, header = TRUE)
cte1 = outDimJump[[1]][1]
outSlection = model.selection.R(genotype2_ExploredModels, cte = cte1, header = TRUE)

[Package ClustMMDD version 1.0.4 Index]