clusterMI {clusterMI}R Documentation

Cluster analysis and pooling after multiple imputation

Description

From a list of imputed datasets clusterMI performs cluster analysis on each imputed data set, estimates the instability of each partition using bootstrap (following Fang, Y. and Wang, J., 2012 <doi:10.1016/j.csda.2011.09.003>) and pools results as proposed in Audigier and Niang (2022) <doi:10.1007/s11634-022-00519-1>.

Usage

clusterMI(
  output,
  method.clustering = "kmeans",
  method.consensus = "NMF",
  scaling = TRUE,
  nb.clust = NULL,
  Cboot = 50,
  method.hclust = "average",
  method.dist = "euclidean",
  modelNames = NULL,
  modelName.hc = "VVV",
  nstart.kmeans = 100,
  iter.max.kmeans = 10,
  m.cmeans = 2,
  samples.clara = 500,
  nnodes = 1,
  instability = TRUE,
  verbose = TRUE,
  nmf.threshold = 10^(-5),
  nmf.nstart = 100,
  nmf.early_stop_iter = 10,
  nmf.initializer = "random",
  nmf.batch_size = NULL,
  nmf.iter.max = 50
)

Arguments

output

an output from the imputedata function

method.clustering

a single string specifying the clustering algorithm used ("kmeans", "pam", "clara", "hclust" or "mixture","cmeans")

method.consensus

a single string specifying the consensus method used to pool the contributory partitions ("NMF" or "CSPA")

scaling

boolean. If TRUE, variables are scaled. Default value is TRUE

nb.clust

an integer specifying the number of clusters

Cboot

an integer specifying the number of bootstrap replications. Default value is 50

method.hclust

character string defining the clustering method for hierarchical clustering (required only if method.clustering = "hclust")

method.dist

character string defining the method use for computing dissimilarity matrices in hierarchical clustering (required only if method.clustering = "hclust")

modelNames

character string indicating the models to be fitted in the EM phase of clustering (required only if method.clustering = "mixture"). By default modelNames = NULL.

modelName.hc

A character string indicating the model to be used in model-based agglomerative hierarchical clustering.(required only if method.clustering = "mixture"). By default modelNames.hc = "VVV".

nstart.kmeans

how many random sets should be chosen for kmeans initalization. Default value is 100 (required only if method.clustering = "kmeans")

iter.max.kmeans

how many iterations should be chosen for kmeans. Default value is 10 (required only if method.clustering = "kmeans")

m.cmeans

degree of fuzzification in cmeans clustering. By default m.cmeans = 2

samples.clara

number of samples to be drawn from the dataset when performing clustering using clara algorithm. Default value is 500.

nnodes

number of CPU cores for parallel computing. By default, nnodes = 1

instability

a boolean indicating if cluster instability must be computed. Default value is TRUE

verbose

a boolean. If TRUE, a message is printed at each step. Default value is TRUE

nmf.threshold

Default value is 10^(-5),

nmf.nstart

Default value is 100,

nmf.early_stop_iter

Default value is 10,

nmf.initializer

Default value is 'random',

nmf.batch_size

Default value is 20,

nmf.iter.max

Default value is 50

Details

clusterMI performs cluster analysis (according to the method.clustering argument) and pooling after multiple imputation. For achieving this goal, the clusterMI function uses as an input an output from the imputedata function and then

  1. applies the cluster analysis method on each imputed data set

  2. pools contributory partitions using non-negative matrix factorization

  3. computes the instability of each partition by bootstrap

  4. computes the total instability

Step 1 can be tuned by specifying the cluster analysis method used (method.clustering argument). If method.clustering = "kmeans" or "pam", then the number of clusters can be specified by tuning the nb.clust argument. By default, the same number as the one used for imputation is used. The number of random initializations can also be tuned through the nstart.kmeans argument. If method.clustering = "hclust" (hierarchical clustering), the method used can be specified (see hclust). By default "average" is used. Furthermore, the number of clusters can be specified, but it can also be automatically chosen if nb.clust < 0. If method.clustering = "mixture" (model-based clustering using gaussian mixture models), the model to be fitted can be tuned by modifying the modelNames argument (see Mclust). If method.clustering = "cmeans" (clustering using the fuzzy c-means algorithm), then the fuzziness parameter can be modfied by tuning them.cmeans argument. By default, m.cmeans = 2.

Step 2 performs consensus clustering by Non-Negative Matrix Factorization, following Li and Ding (2007) <doi:10.1109/ICDM.2007.98>.

Step 3 applies the nselectboot function on each imputed data set and returns the instability of each cluster obtained at step 1. The method is based on bootstrap sampling, followong Fang, Y. and Wang, J. (2012) <doi:10.1016/j.csda.2011.09.003>. The number of iterations can be tuned using the Cboot argument.

Step 4 averages the previous instability measures given a within instability (Ubar), computes a between instability (B) and a total instability (T = B + Ubar). See Audigier and Niang (2022) <doi:10.1007/s11634-022-00519-1> for details.

All steps can be performed in parallel by specifying the number of CPU cores (nnodes argument). Steps 3 and 4 are more time consuming. To compute only steps 1 and 2 use instability = FALSE.

Value

A list with three objects

part

the consensus partition

instability

a list of four objects: U the within instability measure for each imputed data set, Ubar the associated average, B the between instability measure, Tot the total instability measure

call

the matching call

References

Audigier, V. and Niang, N. (2022) Clustering with missing data: which equivalent for Rubin's rules? Advances in Data Analysis and Classification <doi:10.1007/s11634-022-00519-1>

Fang, Y. and Wang, J. (2012) Selection of the number of clusters via the bootstrap method. Computational Statistics and Data Analysis, 56, 468-477. <doi:10.1016/j.csda.2011.09.003>

T. Li, C. Ding, and M. I. Jordan (2007) Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization. In Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, ICDM'07, page 577-582, USA. IEEE Computer Society. <doi:10.1109/ICDM.2007.98>

See Also

hclust, nselectboot, Mclust, imputedata, cmeans

Examples

data(wine)

require(parallel)
set.seed(123456)
ref <- wine$cult
nb.clust <- 3
m <- 5 # number of imputed data sets. Should be larger in practice
wine.na <- wine
wine.na$cult <- NULL
wine.na <- prodna(wine.na)

#imputation
res.imp <- imputedata(data.na = wine.na, nb.clust = nb.clust, m = m)

#analysis by kmeans and pooling
nnodes <- 2 # parallel::detectCores()
res.pool <- clusterMI(res.imp, nnodes = nnodes)

res.pool$instability
table(ref, res.pool$part)


[Package clusterMI version 1.2.1 Index]