check_avgSil {ulrb}R Documentation

Check average Silhouette score index

Description

Calculates average Silhouette score for a given sample.

Usage

check_avgSil(
  data,
  sample_id = NULL,
  samples_col = "Sample",
  abundance_col = "Abundance",
  range = 3:10,
  with_plot = FALSE,
  ...
)

Arguments

data

A tibble with, at least, a column for Abundance and Sample. Additional columns are allowed.

sample_id

String with name of the sample to apply this function.

samples_col

String with name of column with sample names.

abundance_col

String with name of column with abundance values.

range

The range of values of k to test, default is from 3 to 10.

with_plot

If FALSE (default) returns a vector, but if TRUE will return a plot with the scores.

...

Extra arguments.

Details

The average Silhouette score index provides a sense of cluster definition and separation. It varies between -1 (complete cluster overlap) and 1 (no cluster overlap), the closest to 1, the better. Thus, the k value with highest average Silhouette score is the best k. This is the standard metric used by the ulrb package for automation of the decision of k, in functions suggest_k() and define_rb().

Note: The average Silhouette score is different from the common calculation of the Silhouette index, which provides a score for each observation in a clustering result. Just like the name says, we are taking the average of all silhouette scores obtained in a clustering result. In this way we can have a single, comparable value for each k we test.

Data input

This function takes a data.frame with a column for samples and a column for abundance (minimum), but can take any number of other columns. It will then filter the specific sample that you want to analyze. You can also pre-filter for your specific sample, but you still need to provide the sample ID (sample_id) and the table always needs a column for Sample and another for Abundance (indicate how you name them with the arguments samples_col and abundance_col).

Output options

The default option returns a vector with CH scores for each k. This is a simple output that can then be used for other analysis. However, we also provide the option to show a plot (set with_plot = TRUE) with the CH score for each k.

Note that this function does not plot the classical Silhouette plot of a clustering result. To do that particular plot, use the function plot_ulrb_silhouette() instead.

Explanation of average Silhouette score

To calculate the Silhouette score for a single observation, let:

The silhouette score (Sil), is given by:

Sil=(ba)max(a,b)Sil = \frac{(b-a)}{max(a,b)}

Once you have the Silhouette score for all observations in a clustering result, just take the simple mean and get the average Silhouette score.

Silhouette score intuition

From the above formula, Sil=(ba)max(a,b)Sil = \frac{(b-a)}{max(a,b)}, it is clear that, for a given observation:

average Silhouette score intuition

If we take the average of the Silhouette score obtained for each observation in a clustering result, then we have the ability to compare the overall success of that clustering with another clustering. Thus, if we compare the average Silhouette score across different k values, i.e. different number of clusters, we can select the k with highest average Silhouette score.

Value

Vector with average Silhouette score index for each pre-specified k.

References

Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20(C), 53–65.

See Also

define_rb(), suggest_k(), cluster::pam(), cluster::silhouette()

Examples

library(dplyr)
# Just scores
check_avgSil(nice_tidy, sample_id = "ERR2044662")

# To change range
check_avgSil(nice_tidy, sample_id = "ERR2044662", range = 4:11)

# To see a simple plot
check_avgSil(nice_tidy, sample_id = "ERR2044662", range = 4:11, with_plot=TRUE)


[Package ulrb version 0.1.5 Index]