clustMD {clustMD}  R Documentation 
Model Based Clustering for Mixed Data
Description
A function that fits the clustMD model to a data set consisting of any combination of continuous, binary, ordinal and nominal variables.
Usage
clustMD(X, G, CnsIndx, OrdIndx, Nnorms, MaxIter, model, store.params = FALSE,
scale = FALSE, startCL = "hc_mclust", autoStop = FALSE, ma.band = 50,
stop.tol = NA)
Arguments
X 
a data matrix where the variables are ordered so that the continuous variables come first, the binary (coded 1 and 2) and ordinal variables (coded 1, 2, ...) come second and the nominal variables (coded 1, 2, ...) are in last position. 
G 
the number of mixture components to be fitted. 
CnsIndx 
the number of continuous variables in the data set. 
OrdIndx 
the sum of the number of continuous, binary and ordinal variables in the data set. 
Nnorms 
the number of Monte Carlo samples to be used for the intractable Estep in the presence of nominal data. Irrelevant if there are no nominal variables. 
MaxIter 
the maximum number of iterations for which the (MC)EM algorithm should run. 
model 
a string indicating which clustMD model is to be fitted. This
may be one of: 
store.params 
a logical argument indicating if the parameter estimates at each iteration should be saved and returned by the clustMD function. 
scale 
a logical argument indicating if the continuous variables should be standardised. 
startCL 
a string indicating which clustering method should be used to initialise the (MC)EM algorithm. This may be one of "kmeans" (K means clustering), "hclust" (hierarchical clustering), "mclust" (finite mixture of Gaussian distributions), "hc_mclust" (modelbased hierarchical clustering) or "random" (random cluster allocation). 
autoStop 
a logical argument indicating whether the (MC)EM algorithm
should use a stopping criterion to decide if convergence has been
reached. Otherwise the algorithm will run for If only continuous variables are present the algorithm will use Aitken's
acceleration criterion with tolerance If categorical variables are present, the stopping criterion is based
on a moving average of the approximated log likelihood values. Let

ma.band 
the number of iterations to be included in the moving average calculation for the stopping criterion. 
stop.tol 
the tolerance of the (MC)EM stopping criterion. 
Value
An object of class clustMD is returned. The output components are as follows:
model 
The covariance model fitted to the data. 
G 
The number of clusters fitted to the data. 
Y 
The observed data matrix. 
cl 
The cluster to which each observation belongs. 
tau 
A 
means 
A 
A 
A 
Lambda 
A 
Sigma 
A 
BIChat 
The estimated Bayesian information criterion for the model fitted. 
ICLhat 
The estimated integrated classification likelihood criterion for the model fitted. 
paramlist 
If store.params is 
Varnames 
A character vector of names corresponding to the
columns of 
Varnames_sht 
A truncated version of 
likelihood.store 
A vector containing the estimated log likelihood at each iteration. 
References
McParland, D. and Gormley, I.C. (2016). Model based clustering for mixed data: clustMD. Advances in Data Analysis and Classification, 10 (2):155169.
Examples
data(Byar)
# Transformation skewed variables
Byar$Size.of.primary.tumour < sqrt(Byar$Size.of.primary.tumour)
Byar$Serum.prostatic.acid.phosphatase < log(Byar$Serum.prostatic.acid.phosphatase)
# Order variables (Continuous, ordinal, nominal)
Y < as.matrix(Byar[, c(1, 2, 5, 6, 8, 9, 10, 11, 3, 4, 12, 7)])
# Start categorical variables at 1 rather than 0
Y[, 9:12] < Y[, 9:12] + 1
# Standardise continuous variables
Y[, 1:8] < scale(Y[, 1:8])
# Merge categories of EKG variable for efficiency
Yekg < rep(NA, nrow(Y))
Yekg[Y[,12]==1] < 1
Yekg[(Y[,12]==2)(Y[,12]==3)(Y[,12]==4)] < 2
Yekg[(Y[,12]==5)(Y[,12]==6)(Y[,12]==7)] < 3
Y[, 12] < Yekg
## Not run:
res < clustMD(X = Y, G = 3, CnsIndx = 8, OrdIndx = 11, Nnorms = 20000,
MaxIter = 500, model = "EVI", store.params = FALSE, scale = TRUE,
startCL = "kmeans", autoStop= TRUE, ma.band=30, stop.tol=0.0001)
## End(Not run)