R: Cluster analysis and pooling after multiple imputation

clusterMI {clusterMI}

R Documentation

Cluster analysis and pooling after multiple imputation

Description

From a list of imputed datasets clusterMI performs cluster analysis on each imputed data set, estimates the instability of each partition using bootstrap (following Fang, Y. and Wang, J., 2012 <doi:10.1016/j.csda.2011.09.003>) and pools results as proposed in Audigier and Niang (2022) <doi:10.1007/s11634-022-00519-1>.

Usage

clusterMI(
  output,
  method.clustering = "kmeans",
  method.consensus = "NMF",
  scaling = TRUE,
  nb.clust = NULL,
  Cboot = 50,
  method.hclust = "average",
  method.dist = "euclidean",
  modelNames = NULL,
  modelName.hc = "VVV",
  nstart.kmeans = 100,
  iter.max.kmeans = 10,
  m.cmeans = 2,
  samples.clara = 500,
  nnodes = 1,
  instability = TRUE,
  verbose = TRUE,
  nmf.threshold = 10^(-5),
  nmf.nstart = 100,
  nmf.early_stop_iter = 10,
  nmf.initializer = "random",
  nmf.batch_size = NULL,
  nmf.iter.max = 50
)

Arguments

`output`	an output from the imputedata function
`method.clustering`	a single string specifying the clustering algorithm used ("kmeans", "pam", "clara", "hclust" or "mixture","cmeans")
`method.consensus`	a single string specifying the consensus method used to pool the contributory partitions ("NMF" or "CSPA")
`scaling`	boolean. If TRUE, variables are scaled. Default value is TRUE
`nb.clust`	an integer specifying the number of clusters
`Cboot`	an integer specifying the number of bootstrap replications. Default value is 50
`method.hclust`	character string defining the clustering method for hierarchical clustering (required only if method.clustering = "hclust")
`method.dist`	character string defining the method use for computing dissimilarity matrices in hierarchical clustering (required only if method.clustering = "hclust")
`modelNames`	character string indicating the models to be fitted in the EM phase of clustering (required only if method.clustering = "mixture"). By default modelNames = NULL.
`modelName.hc`	A character string indicating the model to be used in model-based agglomerative hierarchical clustering.(required only if method.clustering = "mixture"). By default modelNames.hc = "VVV".
`nstart.kmeans`	how many random sets should be chosen for kmeans initalization. Default value is 100 (required only if method.clustering = "kmeans")
`iter.max.kmeans`	how many iterations should be chosen for kmeans. Default value is 10 (required only if method.clustering = "kmeans")
`m.cmeans`	degree of fuzzification in cmeans clustering. By default m.cmeans = 2
`samples.clara`	number of samples to be drawn from the dataset when performing clustering using clara algorithm. Default value is 500.
`nnodes`	number of CPU cores for parallel computing. By default, nnodes = 1
`instability`	a boolean indicating if cluster instability must be computed. Default value is TRUE
`verbose`	a boolean. If TRUE, a message is printed at each step. Default value is TRUE
`nmf.threshold`	Default value is 10^(-5),
`nmf.nstart`	Default value is 100,
`nmf.early_stop_iter`	Default value is 10,
`nmf.initializer`	Default value is 'random',
`nmf.batch_size`	Default value is 20,
`nmf.iter.max`	Default value is 50

Details

clusterMI performs cluster analysis (according to the method.clustering argument) and pooling after multiple imputation. For achieving this goal, the clusterMI function uses as an input an output from the imputedata function and then

applies the cluster analysis method on each imputed data set
pools contributory partitions using non-negative matrix factorization
computes the instability of each partition by bootstrap
computes the total instability

Step 1 can be tuned by specifying the cluster analysis method used (method.clustering argument). If method.clustering = "kmeans" or "pam", then the number of clusters can be specified by tuning the nb.clust argument. By default, the same number as the one used for imputation is used. The number of random initializations can also be tuned through the nstart.kmeans argument. If method.clustering = "hclust" (hierarchical clustering), the method used can be specified (see hclust). By default "average" is used. Furthermore, the number of clusters can be specified, but it can also be automatically chosen if nb.clust < 0. If method.clustering = "mixture" (model-based clustering using gaussian mixture models), the model to be fitted can be tuned by modifying the modelNames argument (see Mclust). If method.clustering = "cmeans" (clustering using the fuzzy c-means algorithm), then the fuzziness parameter can be modfied by tuning them.cmeans argument. By default, m.cmeans = 2.

Step 2 performs consensus clustering by Non-Negative Matrix Factorization, following Li and Ding (2007) <doi:10.1109/ICDM.2007.98>.

Step 3 applies the nselectboot function on each imputed data set and returns the instability of each cluster obtained at step 1. The method is based on bootstrap sampling, followong Fang, Y. and Wang, J. (2012) <doi:10.1016/j.csda.2011.09.003>. The number of iterations can be tuned using the Cboot argument.

Step 4 averages the previous instability measures given a within instability (Ubar), computes a between instability (B) and a total instability (T = B + Ubar). See Audigier and Niang (2022) <doi:10.1007/s11634-022-00519-1> for details.

All steps can be performed in parallel by specifying the number of CPU cores (nnodes argument). Steps 3 and 4 are more time consuming. To compute only steps 1 and 2 use instability = FALSE.

Value

A list with three objects

`part`	the consensus partition
`instability`	a list of four objects: `U` the within instability measure for each imputed data set, `Ubar` the associated average, `B` the between instability measure, `Tot` the total instability measure
`call`	the matching call

References

Audigier, V. and Niang, N. (2022) Clustering with missing data: which equivalent for Rubin's rules? Advances in Data Analysis and Classification <doi:10.1007/s11634-022-00519-1>

Fang, Y. and Wang, J. (2012) Selection of the number of clusters via the bootstrap method. Computational Statistics and Data Analysis, 56, 468-477. <doi:10.1016/j.csda.2011.09.003>

T. Li, C. Ding, and M. I. Jordan (2007) Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization. In Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, ICDM'07, page 577-582, USA. IEEE Computer Society. <doi:10.1109/ICDM.2007.98>

Examples

data(wine)

require(parallel)
set.seed(123456)
ref <- wine$cult
nb.clust <- 3
m <- 5 # number of imputed data sets. Should be larger in practice
wine.na <- wine
wine.na$cult <- NULL
wine.na <- prodna(wine.na)

#imputation
res.imp <- imputedata(data.na = wine.na, nb.clust = nb.clust, m = m)

#analysis by kmeans and pooling
nnodes <- 2 # parallel::detectCores()
res.pool <- clusterMI(res.imp, nnodes = nnodes)

res.pool$instability
table(ref, res.pool$part)

[Package clusterMI version 1.2.1 Index]