R: Selection of number of clusters via distance-based measures

cDistance {cstab}

R Documentation

Selection of number of clusters via distance-based measures

Description

Selection of number of clusters via gap statistic, jump statistic, and slope statistic

Usage

cDistance(data, kseq, method = "kmeans", linkage = "complete",
  kmIter = 10, gapIter = 10)

Arguments

`data`	a n x p data matrix of type numeric.
`kseq`	a vector with considered numbers clusters k > 1
`method`	character string indicating the clustering algorithm. 'kmeans' for the k-means algorithm, 'hierarchical' for hierarchical clustering.
`linkage`	character specifying the linkage criterion, in case `type='hierarchical'`. The available options are "single", "complete", "average", "mcquitty", "ward.D", "ward.D2", "centroid" or "median". See hclust.
`kmIter`	integer specifying the the number of restarts of the k-means algorithm in order to avoid local minima.
`gapIter`	integer specifying the number of simulated datasets to compute the gap statistic (see Tibshirani et al., 2001).

Value

a list with the optimal numbers of cluster determined by the gap statistic (Tibshirani et al., 2001), the jump Statistic (Sugar & James, 2011) and the slope statistic (Fujita et al., 2014). Along the function returns the gap, jump and slope for each k in kseq.

Author(s)

Dirk U. Wulff <dirk.wulff@gmail.com> Jonas M. B. Haslbeck <jonas.haslbeck@gmail.com>

References

Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411-423.

Sugar, C. A., & James, G. M. (2011). Finding the number of clusters in a dataset. Journal of the American Statistical Association, 98(463), 750-763,

Fujita, A., Takahashi, D. Y., & Patriota, A. G. (2014). A non-parametric method to estimate the number of clusters. Computational Statistics & Data Analysis, 73, 27-39.

Examples

## Not run: 
  # Generate Data from Gaussian Mixture
  s <- .1
  n <- 50
  data <- rbind(cbind(rnorm(n, 0, s), rnorm(n, 0, s)),
                cbind(rnorm(n, 1, s), rnorm(n, 1, s)),
                cbind(rnorm(n, 0, s), rnorm(n, 1, s)),
                cbind(rnorm(n, 1, s), rnorm(n, 0, s)))
  plot(data)

 # Selection of Number of Clusters using Distance-based Measures
 cDistance(data, kseq=2:10)
 
## End(Not run)

[Package cstab version 0.2-2 Index]