purity {funtimes} | R Documentation |
Clustering Purity
Description
Calculate the purity of the clustering results. For example, see Schaeffer et al. (2016).
Usage
purity(classes, clusters)
Arguments
classes |
a vector with labels of true classes. |
clusters |
a vector with labels of assigned clusters for which purity is to
be tested. Should be of the same length as |
Details
Following Manning et al. (2008), each cluster is assigned to the class which is most frequent in the cluster, then
Purity(\Omega,C) = \frac{1}{N}\sum_{k}\max_{j}|\omega_k\cap c_j|,
where \Omega=\{\omega_1,\ldots,\omega_K \}
is the set of identified
clusters and C=\{c_1,\ldots,c_J\}
is the set of classes. That is, within
each class j=1,\ldots,J
find the size of the most populous cluster from
the K-j
unassigned clusters. Then, sum together the \min(K,J)
sizes
found and divide by N
,
where N
= length(classes)
= length(clusters)
.
If \max_{j}|\omega_k\cap c_j|
is not unique for some j
,
it is assigned to the class which the second maximum is the smallest, to
maximize the Purity
(see ‘Examples’).
The number of unique elements
in classes
and clusters
may differ.
Value
A list with two elements:
pur |
purity value. |
out |
table with |
Author(s)
Vyacheslav Lyubchich
References
Manning CD, Raghavan P, Schutze H (2008).
Introduction to Information Retrieval.
Cambridge University Press, New York.
Schaeffer ED, Testa JM, Gel YR, Lyubchich V (2016).
“On information criteria for dynamic spatio-temporal clustering.”
In Banerjee A, Ding W, Dy JG, Lyubchich V, Rhines A (eds.), The 6th International Workshop on Climate Informatics: CI2016, 5–8.
doi:10.5065/D6K072N6.
Examples
# Fix seed for reproducible simulations:
# RNGkind(sample.kind = "Rounding") #run this line to have same seed across R versions > R 3.6.0
set.seed(1)
##### Example 1
#Create some classes and cluster labels:
classes <- rep(LETTERS[1:3], each = 5)
clusters <- sample(letters[1:5], length(classes), replace = TRUE)
#From the table below:
# - cluster 'b' corresponds to class A;
# - either of the clusters 'd' and 'e' can correspond to class B,
# however, 'e' should be chosen, because cluster 'd' also highly
# intersects with Class C. Thus,
# - cluster 'd' corresponds to class C.
table(classes, clusters)
## clusters
##classes a b c d e
## A 0 3 1 0 1
## B 1 0 0 2 2
## C 1 2 0 2 0
#The function does this choice automatically:
purity(classes, clusters)
#Sample output:
##$pur
##[1] 0.4666667
##
##$out
## ClassLabels ClusterLabels ClusterSize
##1 A b 3
##2 B e 2
##3 C d 2
##### Example 2
#The labels can be also numeric:
classes <- rep(1:5, each = 3)
clusters <- sample(1:3, length(classes), replace = TRUE)
purity(classes, clusters)