miclust {miclust}R Documentation

Cluster analysis in multiple imputed data sets with optional variable selection.

Description

Performs cluster analysis in multiple imputed data sets with optional variable selection. Results can be summarized and visualized with the summary and plot methods.

Usage

miclust(
  data,
  method = "kmeans",
  search = c("none", "backward", "forward"),
  ks = 2:3,
  maxvars = NULL,
  usedimp = NULL,
  distance = c("manhattan", "euclidean"),
  centpos = c("means", "medians"),
  initcl = c("hc", "rand"),
  verbose = TRUE,
  seed = NULL
)

Arguments

data

object of class midata obtained with the function getdata.

method

clustering method. Currently, only "kmeans" is accepted.

search

search algorithm for the selection variable procedure: "backward", "forward" or "none". If "none" (default), no variable selection is performed.

ks

the values of the explored number of clusters. Default is exploring 2 and 3 clusters.

maxvars

if method = "forward", the maximum number of variables to be selected.

usedimp

numeric. Which imputed data sets must be included in the cluster analysis. If NULL (default), all available imputed data sets are included. If usedimp is numeric (or a numeric vector), its values indicate which imputed data sets are included.

distance

two metrics are allowed to compute distances: "manhattan" (default) and "euclidean".

centpos

position computation of the cluster centroid. If "means" (default) the position of the centroid is computed by the mean. If "medians", by the median.

initcl

starting values for the clustering algorithm. If "rand", they are randomly selected; if "hc", they are computed via hierarchical clustering. See Details below.

verbose

a logical value indicating output status messages. Default is TRUE.

seed

a number. Seed for reproducibility of results. Default is NULL (no seed).

Details

The optimal number of clusters and the final set of variables are selected according to CritCF. CritCF is defined as

CritCF = \left(\frac{2m}{2m + 1} \cdot \frac{1}{1 + W / B}\right)^{\frac{1 + \log_2(k + 1)}{1 + \log_2(m + 1)}},

where m is the number of variables, k is the number of clusters, and W and B are the within- and between-cluster inertias. Higher values of CritCF are preferred (Breaban, 2011). See References below for further details about the clustering algorithm.

For computational reasons, option "rand" is suggested instead of "hc" for high dimensional data.

Value

A list with class "miclust" including the following items:

clustering

a list of lists containing the results of the clustering algorithm for each analyzed data set and for each analyzed number of clusters. Includes information about selected variables and the cluster vector.

completecasesperc

if data contains a single data frame, percentage of complete cases in data.

data

input data.

ks

the values of the explored number of clusters.

usedimp

indicator of the imputed data sets used.

kfin

optimal number of clusters.

critcf

if data contains a single data frame, critcf contains the optimal (maximum) value of CritCF (see Details) and the number of selected variables in the reduction procedure for each explored number of clusters. If data is a list, critcf contains the optimal value of CritCF for each imputed data set and for each explored value of the number of clusters.

numberofselectedvars

number of selected variables.

selectedkdistribution

if data is a list, frequency of selection of each analyzed number of clusters.

method

input method.

search

input search.

maxvars

input maxvars.

distance

input distance.

centpos

input centpos.

selmetriccent

an object of class kccaFamily needed by the specific summary method.

initcl

input initcl.

References

See Also

getdata for data preparation before using miclust.

Examples

### data preparation:
minhanes1 <- getdata(data = minhanes)

##################
###
### Example 1:
###
### Multiple imputation clustering process with backward variable selection
###
##################

### using only the imputations 1 to 10 for the clustering process and exploring
### 2 vs. 3 clusters:
minhanes1clust <- miclust(data = minhanes1, search = "backward", ks = 2:3,
                          usedimp = 1:10, seed = 4321)
minhanes1clust
minhanes1clust$kfin  ### optimal number of clusters

### graphical summary:
plot(minhanes1clust)

### selection frequency of the variables for the optimal number of clusters:
y <- getvariablesfrequency(minhanes1clust)
y
plot(y$percfreq, type = "h", main = "", xlab = "Variable",
     ylab = "Percentage of times selected", xlim = 0.5 + c(0, length(y$varnames)),
     lwd = 15, col = "blue", xaxt = "n")
axis(1, at = 1:length(y$varnames), labels = y$varnames)

### default summary for the optimal number of clusters:
summary(minhanes1clust)

## summary forcing 3 clusters:
summary(minhanes1clust, k = 3)

##################
###
### Example 2:
###
### Same analysis but without variable selection
###
##################

minhanes2clust <- miclust(data = minhanes1, ks = 2:3, usedimp = 1:10, seed = 4321)
minhanes2clust
plot(minhanes2clust)
summary(minhanes2clust)


##################
###
### Example 3:
###
### Complete case clustering process with backward variable selection
###
##################

nhanes0 <- getdata(data = minhanes[[1]])
nhanes2clust <- miclust(data = nhanes0, search = "backward", ks = 2:3, seed = 4321)
nhanes2clust

summary(nhanes2clust)

### nothing to plot for a single data set analysis
# plot(nhanes2clust)

##################
###
### Example 4:
###
### Complete case clustering process without variable selection
###
##################

nhanes3clust <- miclust(data = nhanes0, ks = 2:3, seed = 4321)
nhanes3clust
summary(nhanes3clust)


[Package miclust version 1.2.8 Index]