miclust {miclust} | R Documentation |
Cluster analysis in multiple imputed data sets with optional variable selection.
Description
Performs cluster analysis in multiple imputed data sets with optional variable
selection. Results can be summarized and visualized with the summary
and plot
methods.
Usage
miclust(
data,
method = "kmeans",
search = c("none", "backward", "forward"),
ks = 2:3,
maxvars = NULL,
usedimp = NULL,
distance = c("manhattan", "euclidean"),
centpos = c("means", "medians"),
initcl = c("hc", "rand"),
verbose = TRUE,
seed = NULL
)
Arguments
data |
object of class |
method |
clustering method. Currently, only |
search |
search algorithm for the selection variable procedure: |
ks |
the values of the explored number of clusters. Default is exploring 2 and 3 clusters. |
maxvars |
if |
usedimp |
numeric. Which imputed data sets must be included in the cluster
analysis. If |
distance |
two metrics are allowed to compute distances: |
centpos |
position computation of the cluster centroid. If |
initcl |
starting values for the clustering algorithm. If |
verbose |
a logical value indicating output status messages. Default is |
seed |
a number. Seed for reproducibility of results. Default is |
Details
The optimal number of clusters and the final set of variables are selected according to CritCF. CritCF is defined as
CritCF = \left(\frac{2m}{2m + 1} \cdot \frac{1}{1 + W / B}\right)^{\frac{1 + \log_2(k + 1)}{1 + \log_2(m + 1)}},
where m
is the number of variables, k
is the number of clusters,
and W
and B
are the within- and between-cluster inertias. Higher
values of CritCF are preferred (Breaban, 2011). See References below for further
details about the clustering algorithm.
For computational reasons, option "rand"
is suggested instead of "hc"
for high dimensional data
.
Value
A list with class "miclust" including the following items:
- clustering
a list of lists containing the results of the clustering algorithm for each analyzed data set and for each analyzed number of clusters. Includes information about selected variables and the cluster vector.
- completecasesperc
if
data
contains a single data frame, percentage of complete cases indata
.- data
input
data
.- ks
the values of the explored number of clusters.
- usedimp
indicator of the imputed data sets used.
- kfin
optimal number of clusters.
- critcf
if
data
contains a single data frame,critcf
contains the optimal (maximum) value of CritCF (see Details) and the number of selected variables in the reduction procedure for each explored number of clusters. Ifdata
is a list,critcf
contains the optimal value of CritCF for each imputed data set and for each explored value of the number of clusters.- numberofselectedvars
number of selected variables.
- selectedkdistribution
if
data
is a list, frequency of selection of each analyzed number of clusters.- method
input
method
.- search
input
search
.- maxvars
input
maxvars
.- distance
input
distance
.- centpos
input
centpos
.- selmetriccent
an object of class
kccaFamily
needed by the specificsummary
method.- initcl
input
initcl
.
References
Basagana X, Barrera-Gomez J, Benet M, Anto JM, Garcia-Aymerich J. A framework for multiple imputation in cluster analysis. American Journal of Epidemiology. 2013;177(7):718-25.
Breaban M, Luchian H. A unifying criterion for unsupervised clustering and feature selection. Pattern Recognition 2001;44(4):854-65.
See Also
getdata
for data preparation before using miclust
.
Examples
### data preparation:
minhanes1 <- getdata(data = minhanes)
##################
###
### Example 1:
###
### Multiple imputation clustering process with backward variable selection
###
##################
### using only the imputations 1 to 10 for the clustering process and exploring
### 2 vs. 3 clusters:
minhanes1clust <- miclust(data = minhanes1, search = "backward", ks = 2:3,
usedimp = 1:10, seed = 4321)
minhanes1clust
minhanes1clust$kfin ### optimal number of clusters
### graphical summary:
plot(minhanes1clust)
### selection frequency of the variables for the optimal number of clusters:
y <- getvariablesfrequency(minhanes1clust)
y
plot(y$percfreq, type = "h", main = "", xlab = "Variable",
ylab = "Percentage of times selected", xlim = 0.5 + c(0, length(y$varnames)),
lwd = 15, col = "blue", xaxt = "n")
axis(1, at = 1:length(y$varnames), labels = y$varnames)
### default summary for the optimal number of clusters:
summary(minhanes1clust)
## summary forcing 3 clusters:
summary(minhanes1clust, k = 3)
##################
###
### Example 2:
###
### Same analysis but without variable selection
###
##################
minhanes2clust <- miclust(data = minhanes1, ks = 2:3, usedimp = 1:10, seed = 4321)
minhanes2clust
plot(minhanes2clust)
summary(minhanes2clust)
##################
###
### Example 3:
###
### Complete case clustering process with backward variable selection
###
##################
nhanes0 <- getdata(data = minhanes[[1]])
nhanes2clust <- miclust(data = nhanes0, search = "backward", ks = 2:3, seed = 4321)
nhanes2clust
summary(nhanes2clust)
### nothing to plot for a single data set analysis
# plot(nhanes2clust)
##################
###
### Example 4:
###
### Complete case clustering process without variable selection
###
##################
nhanes3clust <- miclust(data = nhanes0, ks = 2:3, seed = 4321)
nhanes3clust
summary(nhanes3clust)