cDistance {cstab} | R Documentation |
Selection of number of clusters via distance-based measures
Description
Selection of number of clusters via gap statistic, jump statistic, and slope statistic
Usage
cDistance(data, kseq, method = "kmeans", linkage = "complete",
kmIter = 10, gapIter = 10)
Arguments
data |
a n x p data matrix of type numeric. |
kseq |
a vector with considered numbers clusters k > 1 |
method |
character string indicating the clustering algorithm. 'kmeans' for the k-means algorithm, 'hierarchical' for hierarchical clustering. |
linkage |
character specifying the linkage criterion, in case
|
kmIter |
integer specifying the the number of restarts of the k-means algorithm in order to avoid local minima. |
gapIter |
integer specifying the number of simulated datasets to compute the gap statistic (see Tibshirani et al., 2001). |
Value
a list with the optimal numbers of cluster determined by the gap statistic
(Tibshirani et al., 2001), the jump Statistic (Sugar & James, 2011) and the
slope statistic (Fujita et al., 2014). Along the function returns the gap,
jump and slope
for each k in kseq
.
Author(s)
Dirk U. Wulff <dirk.wulff@gmail.com> Jonas M. B. Haslbeck <jonas.haslbeck@gmail.com>
References
Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411-423.
Sugar, C. A., & James, G. M. (2011). Finding the number of clusters in a dataset. Journal of the American Statistical Association, 98(463), 750-763,
Fujita, A., Takahashi, D. Y., & Patriota, A. G. (2014). A non-parametric method to estimate the number of clusters. Computational Statistics & Data Analysis, 73, 27-39.
Examples
## Not run:
# Generate Data from Gaussian Mixture
s <- .1
n <- 50
data <- rbind(cbind(rnorm(n, 0, s), rnorm(n, 0, s)),
cbind(rnorm(n, 1, s), rnorm(n, 1, s)),
cbind(rnorm(n, 0, s), rnorm(n, 1, s)),
cbind(rnorm(n, 1, s), rnorm(n, 0, s)))
plot(data)
# Selection of Number of Clusters using Distance-based Measures
cDistance(data, kseq=2:10)
## End(Not run)