R: Cluster-wise stability assessment of Joint Dimension...

local_bootclus {clustrd}

R Documentation

Cluster-wise stability assessment of Joint Dimension Reduction and Clustering methods by bootstrapping.

Description

Assessment of the cluster-wise stability of a joint dimension and clustering method. The data is resampled using bootstrapping and the Jaccard similarities of the original clusters to the most similar clusters in the resampled data are computed. The mean over these similarities is used as an index of the stability of a cluster. The method is similar to the one described in Hennig (2007).

Usage

local_bootclus(data, nclus, ndim = NULL, 
method = c("RKM","FKM","mixedRKM","mixedFKM","clusCA","MCAk","iFCB"), 
scale = TRUE, center= TRUE, alpha = NULL, nstart=100, 
nboot=10, alphak = .5, seed = NULL)

Arguments

`data`	Continuous, Categorical ot Mixed data set
`nclus`	Number of clusters
`ndim`	Dimensionality of the solution
`method`	Specifies the method. Options are RKM for Reduced K-means, FKM for Factorial K-means, mixedRKM for Mixed Reduced K-means, mixedFKM for Mixed Factorial K-means, MCAk for MCA K-means, iFCB for Iterative Factorial Clustering of Binary variables and clusCA for Cluster Correspondence Analysis.
`scale`	A logical value indicating whether the metric variables should be scaled to have unit variance before the analysis takes place (default = `TRUE)`
`center`	A logical value indicating whether the metric variables should be shifted to be zero centered (default = `TRUE)`
`alpha`	Adjusts for the relative importance of (mixed) RKM and FKM in the objective function; `alpha = 1` reduces to PCA/PCAMIX, `alpha = 0.5` to (mixed) reduced K-means, and `alpha = 0` to (mixed) factorial K-means
`nstart`	Number of random starts (default = 100)
`nboot`	Number of bootstrap pairs of partitions
`alphak`	Non-negative scalar to adjust for the relative importance of MCA (`alphak = 1`) and K-means (`alphak = 0`) in the solution (default = .5). Works only in combination with `method = "MCAk"`
`seed`	An integer that is used as argument by `set.seed()` for offsetting the random number generator when `smartStart = NULL`. The default value is NULL.

Details

The algorithm for assessing local cluster stability is similar to that in Hennig (2007) and can be summarized in three steps:

Step 1. Resampling: Draw bootstrap samples S_i and T_i of size n from the data and use the original data as evaluation set E_i = X. Apply a joint dimension reduction and clustering method to S_i and T_i and obtain C^S_i and C^T_i.

Step 2. Mapping: Assign each observation x_i to the closest centers of C^S_i and C^T_i using Euclidean distance, resulting in partitions C^XS_i and C^XT_i.

Step 3. Evaluation: Obtain the maximum Jaccard agreement between each original cluster C_k and each one of the two bootstrap clusters, C_^k'XS_i and C_^k'XT_i as measure of agreement and stability, and take the average of each pair.

Inspect the distributions of the maximum Jaccard coefficients to assess the cluster level (local) stability of the solution.

Here are some guidelines for interpretation. Generally, a valid, stable cluster should yield a mean Jaccard similarity value of 0.75 or more. Between 0.6 and 0.75, clusters may be considered as indicating patterns in the data, but which points exactly should belong to these clusters is highly doubtful. Below average Jaccard values of 0.6, clusters should not be trusted. "Highly stable" clusters should yield average Jaccard similarities of 0.85 and above.

While B = 100 is recommended, smaller run numbers could give quite informative results as well, if computation times become too high.

Note that the stability of a cluster is assessed, but stability is not the only important validity criterion - clusters obtained by very inflexible clustering methods may be stable but not valid, as discussed in Hennig (2007).

Value

`nclus`	An integer with the number of clusters
`clust1`	Partitions, C^XS_i of the original data, X, predicted from clustering bootstrap sample S_i (see Details)
`clust2`	Partitions, C^XT_i of the original data, X, predicted from clustering bootstrap sample T_i (see Details)
`index1`	Indices of the original data rows in bootstrap sample S_i
`index2`	Indices of the original data rows in bootstrap sample T_i
`Jaccard`	Mean Jaccard similarity values

References

Hennig, C. (2007). Cluster-wise assessment of cluster stability. Computational Statistics and Data Analysis, 52, 258-271.

Examples

## 5 bootstrap replicates and nstart = 10 for speed in example,
## use more for real applications
data(iris)
bootres = local_bootclus(iris[,-5], nclus = 3, ndim = 2,
method = "RKM", nboot = 5, nstart = 1, seed = 1234)

boxplot(bootres$Jaccard, xlab = "cluster number", ylab =
"Jaccard similarity")

## 5 bootstrap replicates and nstart = 5 for speed in example,
## use more for real applications
#data(diamond)
#bootres = local_bootclus(diamond[,-7], nclus = 4, ndim = 3,
#method = "mixedRKM", nboot = 5, nstart = 10, seed = 1234)

#boxplot(bootres$Jaccard, xlab = "cluster number", ylab =
#"Jaccard similarity")

## 5 bootstrap replicates and nstart = 1 for speed in example,
## use more for real applications
#data(bribery)
#bootres = local_bootclus(bribery, nclus = 5, ndim = 4,
#method = "clusCA", nboot = 10, nstart = 1, seed = 1234)

#boxplot(bootres$Jaccard, xlab = "cluster number", ylab =
#"Jaccard similarity")

[Package clustrd version 1.4.0 Index]