sil.score {bios2mds} | R Documentation |
Silhouette score
Description
Computes silhouette scores for multiple runs of K-means clustering.
Usage
sil.score(mat, nb.clus = c(2:13), nb.run = 100, iter.max = 1000,
method = "euclidean")
Arguments
mat |
a numeric matrix representing the coordinates of the elements. |
nb.clus |
a numeric vector indicating the range of the numbers of clusters. Default is c(2:13). |
nb.run |
a numeric value indicating the number of runs. Default is 100. |
iter.max |
a numeric value indicating the maximum number of iterations for K-means clustering. Default is 1000. |
method |
a string of characters to determine the distance measure. This should be one of "euclidean" , "maximum", "manhattan", "canberra" or "binary". Default is "euclidean". |
Details
Silhouettes are a general graphical aid for interpretation and validation of cluster analysis.
This technique is available through the silhouette
function (cluster
package). In order to
calculate silhouettes, two types of data are needed:
the collection of all distances between objects. These distances are obtained from application of
dist
function on the coordinates of the elements inmat
with argumentmethod
.the partition obtained by the application of a clustering technique. In
sil.score
context, the partition is obtained from theKmeans
function (amap
package) with argumentmethod
which indicates the cluster to which each element is assigned.
For each element, a silhouette value is calculated and evaluates the degree of confidence in the assignment of the element:
well-clustered elements have a score near 1,
poorly-clustered elements have a score near -1.
Thus, silhouettes indicates the objects that are well or poorly clustered. To summarize the results, for each cluster, the silhouettes values can be displayed as an average silhouette width, which is the mean of silhouettes for all the elements assigned to this cluster. Finally, the overall average silhouette width is the mean of average silhouette widths of the different clusters.
Silhouette values offer the advantage that they depend only on the partition of the elements. As a consequence, silhouettes can be used to compare the output of the same clustering algorithm applied
to the same data but for different numbers of clusters. A range of numbers of clusters can be tested, with the nb.clus
argument. The optimal number of clusters is reached for the maximum of the overall
silhouette width. This means that the clustering algorithm reaches a strong clustering structure.
However, for a given number of clusters, the cluster assignment obtained by different K-means runs can be different because the K-means procedure assigns random initial centroids for each run. It may be necessary to run the K-means procedure several times, with the nb.run argument, to evaluate the uncertainty of the results. In that case, for each number of clusters, the mean of the overall average silhouettes for nb.run
K-means runs is calculated. The maximum of this core gives the optimal number of clusters.
Value
A named numeric vector representing the silhouette scores for each number of clusters.
Note
sil.score
requires Kmeans
and silhouette
functions from amap
and
cluster
packages, respectively.
Author(s)
Julien Pele
References
Rousseeuw PJ (1987) Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of Computational and Applied Mathematics, 20:53-65.
Lovmar L, Ahlford A, Jonsson M and Syvanen AC (2005) Silhouette scores for assessment of SNP genotype clusters. BMC Genomics, 6:35.
Guy B, Vasyl P, Susmita D and Somnath D (2008) clValid: An R Package for Cluster Validation. Journal of Statistical Software, 25.
See Also
connectivity
and dunn
functions from clValid
package.
silhouette
function from cluster
package.
Examples
# calculating silhouette scores for K-means clustering of human GPCRs:
data(gpcr)
active <- gpcr$dif$sapiens.sapiens
mds <- mmds(active)
sil.score1 <- sil.score(mds$coord, nb.clus = c(2:10),
nb.run = 100, iter.max = 100)
barplot(sil.score1)