kselection {kselection} | R Documentation |
Selection of K in K-means Clustering
Description
Selection of k in k-means clustering based on Pham et al. paper.
Usage
kselection(
x,
fun_cluster = stats::kmeans,
max_centers = 15,
k_threshold = 0.85,
progressBar = FALSE,
trace = FALSE,
parallel = FALSE,
...
)
Arguments
x |
numeric matrix of data, or an object that can be coerced to such a matrix. |
fun_cluster |
function to cluster by (e.g. |
max_centers |
maximum number of clusters for evaluation. |
k_threshold |
maximum value of |
progressBar |
show a progress bar. |
trace |
display a trace of the progress. |
parallel |
If set to true, use parallel |
... |
arguments to be passed to the kmeans method. |
Details
This function implements the method proposed by Pham, Dimov and Nguyen for
selecting the number of clusters for the K-means algorithm. In this method
a function f(K)
is used to evaluate the quality of the resulting
clustering and help decide on the optimal value of K
for each data
set. The f(K)
function is defined as
f(K) = \left\{
\begin{array}{rl}
1 & \mbox{if $K = 1$} \\
\frac{S_K}{\alpha_K S_{K-1}} & \mbox{if $S_{K-1} \ne 0$, $\forall K >1$} \\
1 & \mbox{if $S_{K-1} = 0$, $\forall K >1$}
\end{array} \right.
where S_K
is the sum of the distortion of all cluster and \alpha_K
is a weight factor which is defined as
\alpha_K = \left\{
\begin{array}{rl}
1 - \frac{3}{4 N_d} & \mbox{if $K = 1$ and $N_d > 1$} \\
\alpha_{K-1} + \frac{1 - \alpha_{K-1}}{6} & \mbox{if $K > 2$ and $N_d > 1$}
\end{array} \right.
where N_d
is the number of dimensions of the data set.
In this definition f(K)
is the ratio of the real distortion to the
estimated distortion and decreases when there are areas of concentration in
the data distribution.
The values of K
that yield f(K) < 0.85
can be recommended for
clustering. If there is not a value of K
which f(K) < 0.85
, it
cannot be considered the existence of clusters in the data set.
Value
an object with the f(K)
results.
Author(s)
Daniel Rodriguez
References
D T Pham, S S Dimov, and C D Nguyen, "Selection of k in k-means clustering", Mechanical Engineering Science, 2004, pp. 103-119.
See Also
Examples
# Create a data set with two clusters
dat <- matrix(c(rnorm(100, 2, .1), rnorm(100, 3, .1),
rnorm(100, -2, .1), rnorm(100, -3, .1)), 200, 2)
# Execute the method
sol <- kselection(dat)
# Get the results
k <- num_clusters(sol) # optimal number of clustes
f_k <- get_f_k(sol) # the f(K) vector
# Plot the results
plot(sol)
## Not run:
# Parallel
require(doMC)
registerDoMC(cores = 4)
system.time(kselection(dat, max_centers = 50 , nstart = 25))
system.time(kselection(dat, max_centers = 50 , nstart = 25, parallel = TRUE))
## End(Not run)