ClustMMDD-package {ClustMMDD}R Documentation

ClustMMDD : Clustering by Mixture Models for Discrete Data.

Description

ClustMMDD stands for "Clustering by Mixture Models for Discrete Data". This package deals with the two-fold problem of variable selection and model-based unsupervised classification in discrete settings. Variable selection and classification are simultaneously solved via a model selection procedure using penalized criteria: Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Integrated Completed Log-likelihood (ICL) or a general criterion with penalty function to be data-driven calibrated.

Details

Package: ClustMMDD
Type: Package
Version: 1.0.1
Date: 2015-05-18
License: GPL (>= 2)

In this package, K and S are respectively the number of clusters and the subset of variables that are relevant for clustering purposes. We assume that a clustering variable has different probability distributions in at least two clusters, and a non-clustering variable has the same distribution in all clusters. We consider a general situation with data described by P random variables X^l, l=1,\cdots,P, where each variable X^l is an unordered set ≤ft\{X^{l,1},\cdots,X^{l,ploidy}\right\} of ploidy categorical variables. For all l, the random variables X^{l,1},\cdots,X^{l,ploidy} take their values in the same set of levels. A typical example of such data comes from population genetics where each genotype of a diploid individual is constituted by ploidy = 2 unordered alleles.

The two-fold problem of clustering and variable selection is seen as a model selection problem. A specific collection of competing models associated to different values of (K, S) is defined, and are compared using penalized criteria. The penalized criteria are of the form

crit≤ft(K,S\right)=γ_n≤ft(K,S\right)+pen≤ft(K,S\right),

where

The penalty functions used in this package are the following, where dim≤ft(K,S\right) is the dimension (number of free parameters) of the model defined by ≤ft(K,S\right) :

The maximum log-likelihood is estimated via the Expectation and Maximisation algorithm. The maximum a posteriori classification is derived from the estimated parameters of the selected model.

Author(s)

Wilson Toussile

Maintainer: Wilson Toussile <wilson.toussile@gmail.com>

References

See Also

The main functions :

em.cluster.R

Compute an approximation of the maximum likelihood estimates of parameters using Expectation and Maximization algorithm, for a given value of (K,S). The maximum a posteriori classification is then derived.

backward.explorer

Gather the most competitive models using a backward-stepwise strategy.

dimJump.R

Perform the data driven calibration of the penalty function via an estimation of λ. Two values are proposed and a graphic is proposed to help user in making a choice.

selectK.R

Perform the selection of the number K of clusters for a given subset of clustering variables.

model.selection.R

Perform a model selection from a collection of competing models.

Examples

data(genotype2)
head(genotype2)
data(genotype2_ExploredModels)
head(genotype2_ExploredModels)

#Calibration of the penalty function
outDimJump = dimJump.R(genotype2_ExploredModels, N = 1000, h = 5, header = TRUE)
cte1 = outDimJump[[1]][1]
outSlection = model.selection.R(genotype2_ExploredModels, cte = cte1, header = TRUE)
outSlection

[Package ClustMMDD version 1.0.4 Index]