global_bootclus {clustrd}R Documentation

Global stabiliy assessment of Joint Dimension Reduction and Clustering methods by bootstrapping.

Description

Runs joint dimension and clustering algorithms repeatedly for different numbers of clusters on bootstrap replica of the original data and returns corresponding cluster assignments, and cluster agreement indices comparing pairs of partitions.

Usage

global_bootclus(data, nclusrange = 3:4, ndim = NULL, 
method = c("RKM","FKM","mixedRKM","mixedFKM","clusCA","MCAk","iFCB"), 
nboot = 10, alpha = NULL, alphak = NULL, center = TRUE, 
scale = TRUE, nstart = 100, smartStart = NULL, seed = NULL)

Arguments

data

Continuous, Categorical ot Mixed data set

nclusrange

An integer or an integer vector with the number of clusters or a range of numbers of clusters (should be greater than one)

ndim

Dimensionality of the solution; if NULL it is set to nclus - 1

method

Specifies the method. Options are RKM for Reduced K-means, FKM for Factorial K-means, mixedRKM for Mixed Reduced K-means, mixedFKM for Mixed Factorial K-means, MCAk for MCA K-means, iFCB for Iterative Factorial Clustering of Binary variables and clusCA for Cluster Correspondence Analysis.

nboot

Number of bootstrap pairs of partitions

alpha

Adjusts for the relative importance of (mixed) RKM and FKM in the objective function; alpha = 1 reduces to PCA/PCAMIX, alpha = 0.5 to (mixed) reduced K-means, and alpha = 0 to (mixed) factorial K-means

alphak

Non-negative scalar to adjust for the relative importance of MCA (alphak = 1) and K-means (alphak = 0) in the solution (default = .5). Works only in combination with method = "MCAk"

center

A logical value indicating whether the metric variables should be shifted to be zero centered (default = TRUE)

scale

A logical value indicating whether the metric variables should be scaled to have unit variance before the analysis takes place (default = TRUE)

nstart

Number of random starts (default = 100)

smartStart

If NULL then a random cluster membership vector is generated. Alternatively, a cluster membership vector can be provided as a starting solution

seed

An integer that is used as argument by set.seed() for offsetting the random number generator when smartStart = NULL. The default value is NULL.

Details

The algorithm for assessing global cluster stability is similar to that in Dolnicar and Leisch (2010) and can be summarized in three steps:

Step 1. Resampling: Draw bootstrap samples S_i and T_i of size n from the data and use the original data, X, as evaluation set E_i = X. Apply the clustering method of choice to S_i and T_i and obtain C^S_i and C^T_i.

Step 2. Mapping: Assign each observation x_i to the closest centers of C^S_i and C^T_i using Euclidean distance, resulting in partitions C^XS_i and C^XT_i, where C^XS_i is the partition of the original data, X, predicted from clustering bootstrap sample S_i (same for T_i and C^XT_i).

Step 3. Evaluation: Use the Adjusted Rand Index (ARI, Hubert & Arabie, 1985) or the Measure of Concordance (MOC, Pfitzner 2008) as measure of agreement and stability.

Inspect the distributions of ARI/MOC to assess the global reproducibility of the clustering solutions.

While nboot = 100 is recommended, smaller run numbers could give quite informative results as well, if computation times become too high.

Note that the stability of a clustering solution is assessed, but stability is not the only important validity criterion - clustering solutions obtained by very inflexible clustering methods may be stable but not valid, as discussed in Hennig (2007).

Value

nclusrange

An integer or an integer vector with the number of clusters or a range of numbers of clusters

clust1

Partitions, C^XS_i of the original data, X, predicted from clustering bootstrap sample S_i (see Details)

clust2

Partitions, C^XT_i of the original data, X, predicted from clustering bootstrap sample T_i (see Details)

index1

Indices of the original data rows in bootstrap sample S_i

index2

Indices of the original data rows in bootstrap sample T_i

rand

Adjusted Rand Index values

moc

Measure of Concordance values

References

Hennig, C. (2007). Cluster-wise assessment of cluster stability. Computational Statistics and Data Analysis, 52, 258-271.

Pfitzner, D., Leibbrandt, R., & Powers, D. (2009). Characterization and evaluation of similarity measures for pairs of clusterings. Knowledge and Information Systems, 19(3), 361-394.

Dolnicar, S., & Leisch, F. (2010). Evaluation of structure and reproducibility of cluster solutions using the bootstrap. Marketing Letters, 21(1), 83-101.

See Also

local_bootclus

Examples

## 3 bootstrap replicates and nstart = 1 for speed in example,
## use at least 20 replicates for real applications
data(diamond)
boot_mixedRKM = global_bootclus(diamond[,-7], nclusrange = 3:4,
method = "mixedRKM", nboot = 3, nstart = 1, seed = 1234)

boxplot(boot_mixedRKM$rand, xlab = "Number of clusters", ylab =
"adjusted Rand Index")

## 5 bootstrap replicates and nstart = 10 for speed in example,
## use more for real applications
#data(macro)
#boot_RKM = global_bootclus(macro, nclusrange = 2:5,
#method = "RKM", nboot = 5, nstart = 10, seed = 1234)

#boxplot(boot_RKM$rand, xlab = "Number of clusters", ylab =
#"adjusted Rand Index")

## 5 bootstrap replicates and nstart = 1 for speed in example,
## use more for real applications
#data(bribery)
#boot_cluCA = global_bootclus(bribery, nclusrange = 2:5, 
#method = "clusCA", nboot = 5, nstart = 1, seed = 1234)

#boxplot(boot_cluCA$rand, xlab = "Number of clusters", ylab =
#"adjusted Rand Index")

[Package clustrd version 1.4.0 Index]