get_clustering_stats {scclust}R Documentation

Get clustering statistics

Description

get_clustering_stats calculates statistics of a clustering.

Usage

get_clustering_stats(distances, clustering)

Arguments

distances

a distances object describing the distances between the data points in clustering.

clustering

a scclust object containing a non-empty clustering.

Details

The function reports the following measures:

num_data_points total number of data points
num_assigned number of points assigned to a cluster
num_clusters number of clusters
min_cluster_size size of the smallest cluster
max_cluster_size size of the largest cluster
avg_cluster_size average cluster size
sum_dists sum of all within-cluster distances
min_dist smallest within-cluster distance
max_dist largest within-cluster distance
avg_min_dist average of the clusters' smallest distances
avg_max_dist average of the clusters' largest distances
avg_dist_weighted average of the clusters' average distances weighed by cluster size
avg_dist_unweighted average of the clusters' average distances (unweighed)

Let d(i,j) denote the distance between data points i and j. Let c be a cluster containing the indices of points assigned to the cluster. Let

D(c) = \{d(i,j): i,j \in c \wedge i>j\}

be a function returning all within-cluster distances in c. Let C be a set containing all clusters.

sum_dists is defined as:

\sum_{c\in C} sum(D(c))

min_dist is defined as:

\min_{c\in C} \min(D(c))

max_dist is defined as:

\max_{c\in C} \max(D(c))

avg_min_dist is defined as:

\sum_{c\in C} \frac{\min(D(c))}{|C|}

avg_max_dist is defined as:

\sum_{c\in C} \frac{\max(D(c))}{|C|}

Let:

AD(c) = \frac{sum(D(c))}{|D(c)|}

be the average within-cluster distance in cluster c.

avg_dist_weighted is defined as:

\sum_{c\in C} \frac{|c| AD(c)}{num_assigned}

where num_assigned is the number of assigned data points (see above).

avg_dist_unweighted is defined as:

\sum_{c\in C} \frac{AD(c)}{|C|}

Value

Returns a list of class clustering_stats containing the statistics.

Examples

my_data_points <- data.frame(x = c(0.1, 0.2, 0.3, 0.4, 0.5,
                                   0.6, 0.7, 0.8, 0.9, 1.0),
                             y = c(10, 9, 8, 7, 6,
                                   10, 9, 8, 7, 6))

my_distances <- distances(my_data_points)

my_scclust <- scclust(c("A", "A", "B", "C", "B",
                        "C", "C", "A", "B", "B"))

get_clustering_stats(my_distances, my_scclust)

# >                     Value
# > num_data_points     10.0000000
# > num_assigned        10.0000000
# > num_clusters         3.0000000
# > min_cluster_size     3.0000000
# > max_cluster_size     4.0000000
# > avg_cluster_size     3.3333333
# > sum_dists           18.2013097
# > min_dist             0.5000000
# > max_dist             3.0066593
# > avg_min_dist         0.8366584
# > avg_max_dist         2.4148611
# > avg_dist_weighted    1.5575594
# > avg_dist_unweighted  1.5847484


[Package scclust version 0.2.4 Index]