purity {funtimes} | R Documentation |
Clustering Purity
Description
Calculate the purity of the clustering results. For example, see Schaeffer et al. (2016).
Usage
purity(classes, clusters)
Arguments
classes |
a vector with labels of true classes. |
clusters |
a vector with labels of assigned clusters for which purity is to
be tested. Should be of the same length as |
Details
Following Manning et al. (2008), each cluster is assigned to the class which is most frequent in the cluster, then
where is the set of identified
clusters and
is the set of classes. That is, within
each class
find the size of the most populous cluster from
the
unassigned clusters. Then, sum together the
sizes
found and divide by
,
where
=
length(classes)
= length(clusters)
.
If is not unique for some
,
it is assigned to the class which the second maximum is the smallest, to
maximize the
(see ‘Examples’).
The number of unique elements
in classes
and clusters
may differ.
Value
A list with two elements:
pur |
purity value. |
out |
table with |
Author(s)
Vyacheslav Lyubchich
References
Manning CD, Raghavan P, Schutze H (2008).
Introduction to Information Retrieval.
Cambridge University Press, New York.
Schaeffer ED, Testa JM, Gel YR, Lyubchich V (2016).
“On information criteria for dynamic spatio-temporal clustering.”
In Banerjee A, Ding W, Dy JG, Lyubchich V, Rhines A (eds.), The 6th International Workshop on Climate Informatics: CI2016, 5–8.
doi:10.5065/D6K072N6.
Examples
# Fix seed for reproducible simulations:
# RNGkind(sample.kind = "Rounding") #run this line to have same seed across R versions > R 3.6.0
set.seed(1)
##### Example 1
#Create some classes and cluster labels:
classes <- rep(LETTERS[1:3], each = 5)
clusters <- sample(letters[1:5], length(classes), replace = TRUE)
#From the table below:
# - cluster 'b' corresponds to class A;
# - either of the clusters 'd' and 'e' can correspond to class B,
# however, 'e' should be chosen, because cluster 'd' also highly
# intersects with Class C. Thus,
# - cluster 'd' corresponds to class C.
table(classes, clusters)
## clusters
##classes a b c d e
## A 0 3 1 0 1
## B 1 0 0 2 2
## C 1 2 0 2 0
#The function does this choice automatically:
purity(classes, clusters)
#Sample output:
##$pur
##[1] 0.4666667
##
##$out
## ClassLabels ClusterLabels ClusterSize
##1 A b 3
##2 B e 2
##3 C d 2
##### Example 2
#The labels can be also numeric:
classes <- rep(1:5, each = 3)
clusters <- sample(1:3, length(classes), replace = TRUE)
purity(classes, clusters)