R: Silhouette score

sil.score {bios2mds}

R Documentation

Silhouette score

Description

Computes silhouette scores for multiple runs of K-means clustering.

Usage

sil.score(mat, nb.clus = c(2:13), nb.run = 100, iter.max = 1000,
method = "euclidean")

Arguments

`mat`	a numeric matrix representing the coordinates of the elements.
`nb.clus`	a numeric vector indicating the range of the numbers of clusters. Default is c(2:13).
`nb.run`	a numeric value indicating the number of runs. Default is 100.
`iter.max`	a numeric value indicating the maximum number of iterations for K-means clustering. Default is 1000.
`method`	a string of characters to determine the distance measure. This should be one of "euclidean" , "maximum", "manhattan", "canberra" or "binary". Default is "euclidean".

Details

Silhouettes are a general graphical aid for interpretation and validation of cluster analysis. This technique is available through the silhouette function (cluster package). In order to calculate silhouettes, two types of data are needed:

the collection of all distances between objects. These distances are obtained from application of dist function on the coordinates of the elements in mat with argument method.
the partition obtained by the application of a clustering technique. In sil.score context, the partition is obtained from the Kmeans function (amap package) with argument method which indicates the cluster to which each element is assigned.

For each element, a silhouette value is calculated and evaluates the degree of confidence in the assignment of the element:

well-clustered elements have a score near 1,
poorly-clustered elements have a score near -1.

Thus, silhouettes indicates the objects that are well or poorly clustered. To summarize the results, for each cluster, the silhouettes values can be displayed as an average silhouette width, which is the mean of silhouettes for all the elements assigned to this cluster. Finally, the overall average silhouette width is the mean of average silhouette widths of the different clusters.

Silhouette values offer the advantage that they depend only on the partition of the elements. As a consequence, silhouettes can be used to compare the output of the same clustering algorithm applied to the same data but for different numbers of clusters. A range of numbers of clusters can be tested, with the nb.clus argument. The optimal number of clusters is reached for the maximum of the overall silhouette width. This means that the clustering algorithm reaches a strong clustering structure. However, for a given number of clusters, the cluster assignment obtained by different K-means runs can be different because the K-means procedure assigns random initial centroids for each run. It may be necessary to run the K-means procedure several times, with the nb.run argument, to evaluate the uncertainty of the results. In that case, for each number of clusters, the mean of the overall average silhouettes for nb.run K-means runs is calculated. The maximum of this core gives the optimal number of clusters.

Value

A named numeric vector representing the silhouette scores for each number of clusters.

Note

sil.score requires Kmeans and silhouette functions from amap and cluster packages, respectively.

Author(s)

Julien Pele

References

Rousseeuw PJ (1987) Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of Computational and Applied Mathematics, 20:53-65.

Lovmar L, Ahlford A, Jonsson M and Syvanen AC (2005) Silhouette scores for assessment of SNP genotype clusters. BMC Genomics, 6:35.

Guy B, Vasyl P, Susmita D and Somnath D (2008) clValid: An R Package for Cluster Validation. Journal of Statistical Software, 25.

Examples

# calculating silhouette scores for K-means clustering of human GPCRs:
data(gpcr)
active <- gpcr$dif$sapiens.sapiens
mds <- mmds(active)
sil.score1 <- sil.score(mds$coord, nb.clus = c(2:10),
 nb.run = 100, iter.max = 100)
barplot(sil.score1)

[Package bios2mds version 1.2.3 Index]