clusterMI {clusterMI} | R Documentation |
Cluster analysis and pooling after multiple imputation
Description
From a list of imputed datasets clusterMI
performs cluster analysis on each imputed data set, estimates the instability of each partition using bootstrap (following Fang, Y. and Wang, J., 2012 <doi:10.1016/j.csda.2011.09.003>) and pools results as proposed in Audigier and Niang (2022) <doi:10.1007/s11634-022-00519-1>.
Usage
clusterMI(
output,
method.clustering = "kmeans",
method.consensus = "NMF",
scaling = TRUE,
nb.clust = NULL,
Cboot = 50,
method.hclust = "average",
method.dist = "euclidean",
modelNames = NULL,
modelName.hc = "VVV",
nstart.kmeans = 100,
iter.max.kmeans = 10,
m.cmeans = 2,
samples.clara = 500,
nnodes = 1,
instability = TRUE,
verbose = TRUE,
nmf.threshold = 10^(-5),
nmf.nstart = 100,
nmf.early_stop_iter = 10,
nmf.initializer = "random",
nmf.batch_size = NULL,
nmf.iter.max = 50
)
Arguments
output |
an output from the imputedata function |
method.clustering |
a single string specifying the clustering algorithm used ("kmeans", "pam", "clara", "hclust" or "mixture","cmeans") |
method.consensus |
a single string specifying the consensus method used to pool the contributory partitions ("NMF" or "CSPA") |
scaling |
boolean. If TRUE, variables are scaled. Default value is TRUE |
nb.clust |
an integer specifying the number of clusters |
Cboot |
an integer specifying the number of bootstrap replications. Default value is 50 |
method.hclust |
character string defining the clustering method for hierarchical clustering (required only if method.clustering = "hclust") |
method.dist |
character string defining the method use for computing dissimilarity matrices in hierarchical clustering (required only if method.clustering = "hclust") |
modelNames |
character string indicating the models to be fitted in the EM phase of clustering (required only if method.clustering = "mixture"). By default modelNames = NULL. |
modelName.hc |
A character string indicating the model to be used in model-based agglomerative hierarchical clustering.(required only if method.clustering = "mixture"). By default modelNames.hc = "VVV". |
nstart.kmeans |
how many random sets should be chosen for kmeans initalization. Default value is 100 (required only if method.clustering = "kmeans") |
iter.max.kmeans |
how many iterations should be chosen for kmeans. Default value is 10 (required only if method.clustering = "kmeans") |
m.cmeans |
degree of fuzzification in cmeans clustering. By default m.cmeans = 2 |
samples.clara |
number of samples to be drawn from the dataset when performing clustering using clara algorithm. Default value is 500. |
nnodes |
number of CPU cores for parallel computing. By default, nnodes = 1 |
instability |
a boolean indicating if cluster instability must be computed. Default value is TRUE |
verbose |
a boolean. If TRUE, a message is printed at each step. Default value is TRUE |
nmf.threshold |
Default value is 10^(-5), |
nmf.nstart |
Default value is 100, |
nmf.early_stop_iter |
Default value is 10, |
nmf.initializer |
Default value is 'random', |
nmf.batch_size |
Default value is 20, |
nmf.iter.max |
Default value is 50 |
Details
clusterMI
performs cluster analysis (according to the method.clustering
argument) and pooling after multiple imputation. For achieving this goal, the clusterMI
function uses as an input an output from the imputedata
function and then
applies the cluster analysis method on each imputed data set
pools contributory partitions using non-negative matrix factorization
computes the instability of each partition by bootstrap
computes the total instability
Step 1 can be tuned by specifying the cluster analysis method used (method.clustering
argument).
If method.clustering = "kmeans"
or "pam"
, then the number of clusters can be specified by tuning the nb.clust
argument. By default, the same number as the one used for imputation is used.
The number of random initializations can also be tuned through the nstart.kmeans
argument.
If method.clustering = "hclust"
(hierarchical clustering), the method used can be specified (see hclust
). By default "average"
is used. Furthermore, the number of clusters can be specified, but it can also be automatically chosen if nb.clust
< 0.
If method.clustering = "mixture"
(model-based clustering using gaussian mixture models), the model to be fitted can be tuned by modifying the modelNames
argument (see Mclust
).
If method.clustering = "cmeans"
(clustering using the fuzzy c-means algorithm), then the fuzziness parameter can be modfied by tuning them.cmeans
argument. By default, m.cmeans = 2
.
Step 2 performs consensus clustering by Non-Negative Matrix Factorization, following Li and Ding (2007) <doi:10.1109/ICDM.2007.98>.
Step 3 applies the nselectboot
function on each imputed data set and returns the instability of each cluster obtained at step 1. The method is based on bootstrap sampling, followong Fang, Y. and Wang, J. (2012) <doi:10.1016/j.csda.2011.09.003>. The number of iterations can be tuned using the Cboot
argument.
Step 4 averages the previous instability measures given a within instability (Ubar
), computes a between instability (B
) and a total instability (T
= B + Ubar). See Audigier and Niang (2022) <doi:10.1007/s11634-022-00519-1> for details.
All steps can be performed in parallel by specifying the number of CPU cores (nnodes
argument). Steps 3 and 4 are more time consuming. To compute only steps 1 and 2 use instability = FALSE
.
Value
A list with three objects
part |
the consensus partition |
instability |
a list of four objects: |
call |
the matching call |
References
Audigier, V. and Niang, N. (2022) Clustering with missing data: which equivalent for Rubin's rules? Advances in Data Analysis and Classification <doi:10.1007/s11634-022-00519-1>
Fang, Y. and Wang, J. (2012) Selection of the number of clusters via the bootstrap method. Computational Statistics and Data Analysis, 56, 468-477. <doi:10.1016/j.csda.2011.09.003>
T. Li, C. Ding, and M. I. Jordan (2007) Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization. In Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, ICDM'07, page 577-582, USA. IEEE Computer Society. <doi:10.1109/ICDM.2007.98>
See Also
hclust
, nselectboot
, Mclust
, imputedata
, cmeans
Examples
data(wine)
require(parallel)
set.seed(123456)
ref <- wine$cult
nb.clust <- 3
m <- 5 # number of imputed data sets. Should be larger in practice
wine.na <- wine
wine.na$cult <- NULL
wine.na <- prodna(wine.na)
#imputation
res.imp <- imputedata(data.na = wine.na, nb.clust = nb.clust, m = m)
#analysis by kmeans and pooling
nnodes <- 2 # parallel::detectCores()
res.pool <- clusterMI(res.imp, nnodes = nnodes)
res.pool$instability
table(ref, res.pool$part)