randIndex {flexclust} | R Documentation |
Compare Partitions
Description
Compute the (adjusted) Rand, Jaccard and Fowlkes-Mallows index for agreement of two partitions.
Usage
comPart(x, y, type=c("ARI","RI","J","FM"))
## S4 method for signature 'flexclust,flexclust'
comPart(x, y, type)
## S4 method for signature 'numeric,numeric'
comPart(x, y, type)
## S4 method for signature 'flexclust,numeric'
comPart(x, y, type)
## S4 method for signature 'numeric,flexclust'
comPart(x, y, type)
randIndex(x, y, correct=TRUE, original=!correct)
## S4 method for signature 'table,missing'
randIndex(x, y, correct=TRUE, original=!correct)
## S4 method for signature 'ANY,ANY'
randIndex(x, y, correct=TRUE, original=!correct)
Arguments
x |
Either a 2-dimensional cross-tabulation of cluster
assignments (for |
y |
An object inheriting from class
|
type |
character vector of abbreviations of indices to compute. |
correct , original |
Logical, correct the Rand index for agreement by chance? |
Value
A vector of indices.
Rand Index
Let A
denote the number of all pairs of data
points which are either put into the same cluster by both partitions or
put into different clusters by both partitions. Conversely, let D
denote the number of all pairs of data points that are put into one
cluster in one partition, but into different clusters by the other
partition. The partitions disagree for all pairs D
and
agree for all pairs A
. We can measure the agreement by the Rand
index A/(A+D)
which is invariant with respect to permutations of
cluster labels.
The index has to be corrected for agreement by chance if the sizes of the clusters are not uniform (which is usually the case), or if there are many clusters, see Hubert & Arabie (1985) for details.
Jaccard Index
If the number of clusters is very large, then usually the vast
majority of pairs of points will not be in the same cluster. The
Jaccard index tries to account for this by using only pairs of points
that are in the same cluster in the defintion of A
.
Fowlkes-Mallows
Let A
again be the pairs of points that
are in the same cluster in both partitions. Fowlkes-Mallows divides
this number by the geometric mean of the sums of the number of pairs in each
cluster of the two partitions. This gives the probability that a pair
of points which are in the same cluster in one partition are also in the
same cluster in the other partition.
Author(s)
Friedrich Leisch
References
Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2, 193–218, 1985.
Marina Meila. Comparing clusterings - an axiomatic view. In Stefan Wrobel and Luc De Raedt, editors, Proceedings of the International Machine Learning Conference (ICML). ACM Press, 2005.
Examples
## no class correlations: corrected Rand almost zero
g1 <- sample(1:5, size=1000, replace=TRUE)
g2 <- sample(1:5, size=1000, replace=TRUE)
tab <- table(g1, g2)
randIndex(tab)
## uncorrected version will be large, because there are many points
## which are assigned to different clusters in both cases
randIndex(tab, correct=FALSE)
comPart(g1, g2)
## let pairs (g1=1,g2=1) and (g1=3,g2=3) agree better
k <- sample(1:1000, size=200)
g1[k] <- 1
g2[k] <- 1
k <- sample(1:1000, size=200)
g1[k] <- 3
g2[k] <- 3
tab <- table(g1, g2)
## the index should be larger than before
randIndex(tab, correct=TRUE, original=TRUE)
comPart(g1, g2)