R: Dertermining and Visualizing the Optimal Number of Clusters

fviz_nbclust {factoextra}

R Documentation

Dertermining and Visualizing the Optimal Number of Clusters

Description

Partitioning methods, such as k-means clustering require the users to specify the number of clusters to be generated.

fviz_nbclust(): Dertemines and visualize the optimal number of clusters using different methods: within cluster sums of squares, average silhouette and gap statistics.
fviz_gap_stat(): Visualize the gap statistic generated by the function clusGap() [in cluster package]. The optimal number of clusters is specified using the "firstmax" method (?cluster::clustGap).

Usage

fviz_nbclust(
  x,
  FUNcluster = NULL,
  method = c("silhouette", "wss", "gap_stat"),
  diss = NULL,
  k.max = 10,
  nboot = 100,
  verbose = interactive(),
  barfill = "steelblue",
  barcolor = "steelblue",
  linecolor = "steelblue",
  print.summary = TRUE,
  ...
)

fviz_gap_stat(
  gap_stat,
  linecolor = "steelblue",
  maxSE = list(method = "firstSEmax", SE.factor = 1)
)

Arguments

`x`	numeric matrix or data frame. In the function fviz_nbclust(), x can be the results of the function NbClust().
`FUNcluster`	a partitioning function which accepts as first argument a (data) matrix like x, second argument, say k, k >= 2, the number of clusters desired, and returns a list with a component named cluster which contains the grouping of observations. Allowed values include: kmeans, cluster::pam, cluster::clara, cluster::fanny, hcut, etc. This argument is not required when x is an output of the function `NbClust::NbClust()`.
`method`	the method to be used for estimating the optimal number of clusters. Possible values are "silhouette" (for average silhouette width), "wss" (for total within sum of square) and "gap_stat" (for gap statistics).
`diss`	dist object as produced by dist(), i.e.: diss = dist(x, method = "euclidean"). Used to compute the average silhouette width of clusters, the within sum of square and hierarchical clustering. If NULL, dist(x) is computed with the default method = "euclidean"
`k.max`	the maximum number of clusters to consider, must be at least two.
`nboot`	integer, number of Monte Carlo ("bootstrap") samples. Used only for determining the number of clusters using gap statistic.
`verbose`	logical value. If TRUE, the result of progress is printed.
`barfill`, `barcolor`	fill color and outline color for bars
`linecolor`	color for lines
`print.summary`	logical value. If true, the optimal number of clusters are printed in fviz_nbclust().
`...`	optionally further arguments for FUNcluster()
`gap_stat`	an object of class "clusGap" returned by the function clusGap() [in cluster package]
`maxSE`	a list containing the parameters (method and SE.factor) for determining the location of the maximum of the gap statistic (Read the documentation ?cluster::maxSE). Allowed values for maxSE$method include: "globalmax": simply corresponds to the global maximum, i.e., is which.max(gap) "firstmax": gives the location of the first local maximum "Tibs2001SEmax": uses the criterion, Tibshirani et al (2001) proposed: "the smallest k such that gap(k) >= gap(k+1) - s_k+1". It's also possible to use "the smallest k such that gap(k) >= gap(k+1) - SE.factors_k+1" where SE.factor is a numeric value which can be 1 (default), 2, 3, etc. "firstSEmax": location of the first f() value which is not larger than the first local maximum minus SE.factor SE.f[], i.e, within an "f S.E." range of that maximum. see ?cluster::maxSE for more options

Value

fviz_nbclust, fviz_gap_stat: return a ggplot2

Author(s)

Alboukadel Kassambara alboukadel.kassambara@gmail.com

Examples

set.seed(123)

# Data preparation
# +++++++++++++++
data("iris")
head(iris)
# Remove species column (5) and scale the data
iris.scaled <- scale(iris[, -5])


# Optimal number of clusters in the data
# ++++++++++++++++++++++++++++++++++++++
# Examples are provided only for kmeans, but
# you can also use cluster::pam (for pam) or
#  hcut (for hierarchical clustering)
 
### Elbow method (look at the knee)
# Elbow method for kmeans
fviz_nbclust(iris.scaled, kmeans, method = "wss") +
geom_vline(xintercept = 3, linetype = 2)

# Average silhouette for kmeans
fviz_nbclust(iris.scaled, kmeans, method = "silhouette")

### Gap statistic
library(cluster)
set.seed(123)
# Compute gap statistic for kmeans
# we used B = 10 for demo. Recommended value is ~500
gap_stat <- clusGap(iris.scaled, FUN = kmeans, nstart = 25,
 K.max = 10, B = 10)
 print(gap_stat, method = "firstmax")
fviz_gap_stat(gap_stat)
 
# Gap statistic for hierarchical clustering
gap_stat <- clusGap(iris.scaled, FUN = hcut, K.max = 10, B = 10)
fviz_gap_stat(gap_stat)

[Package factoextra version 1.0.7 Index]

Dertermining and Visualizing the Optimal Number of Clusters

Description

Usage

Arguments

Value

Author(s)

See Also

Examples