rand {catsim} | R Documentation |
Similarity Indices
Description
The Rand index, rand_index, computes the agreement between two different clusterings or partitions of the same set of objects. The inputs to the function should be binary or categorical and of the same length.
The adjusted Rand index, adj_rand
,
computes a corrected version
of the Rand index, adjusting for the probability
of chance agreement of clusterings. A small constant is added to the
numerator and denominator of the adjusted Rand index to ensure stability
when there is a small or 0 denominator, as it is possible to have a zero
denominator.
Cohen's kappa, cohen_kappa
,
is an inter-rater agreement metric for two raters which
corrects for the probability of chance agreement. Note
there is a difference here
between this measure and the Rand indices and mutual information:
those consider the similarities of the groupings of points,
while this considers how often the
raters agreed on individual points.
Like the Rand index, the mutual information
computes the agreement between two different clusterings or
partitions of the same set of objects. If H(X)
is the
entropy of some probability distribution X
, then
the mutual information of two distributions is
I(X;Y) = -H(X,Y) +H(X) + H(Y)
.
The normalized mutual information, normalized_mi
, is defined here as:
2I(X;Y)/(H(X)+H(Y)),
but is set to be 0 if both H(X) and H(Y) are 0.
The adjusted mutual information, adjusted_mi
,
is a correction of the mutual information to account
for the probability of chance agreement in a manner similar to the
adjusted Rand index
or Cohen's kappa.
Usage
rand_index(x, y, na.rm = FALSE)
adj_rand(x, y, na.rm = FALSE)
cohen_kappa(x, y, na.rm = FALSE)
normalized_mi(x, y, na.rm = FALSE)
adjusted_mi(x, y, na.rm = FALSE)
Arguments
x , y |
a numeric or factor vector or array |
na.rm |
whether to remove |
Value
the similarity index, which is between 0 and 1 for most of the options. The adjusted Rand and Cohen's kappa can be negative, but are bounded above by 1.
References
W. M. Rand (1971). "Objective criteria for the evaluation of clustering methods". Journal of the American Statistical Association. American Statistical Association. 66 (336): 846–850. doi: 10.2307/2284239
Lawrence Hubert and Phipps Arabie (1985). "Comparing partitions". Journal of Classification. 2 (1): 193–218. doi: 10.1007/BF01908075
Cohen, Jacob (1960). "A coefficient of agreement for nominal scales". Educational and Psychological Measurement. 20 (1): 37–46. doi: 10.1177/001316446002000104
Jaccard, Paul (1912). "The distribution of the flora in the alpine zone,” New Phytologist, vol. 11, no. 2, pp. 37–50. doi: 10.1111/j.1469-8137.1912.tb05611.x
Nguyen Xuan Vinh, Julien Epps, and James Bailey (2010). Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance. J. Mach. Learn. Res. 11 (December 2010), 2837–2854. https://www.jmlr.org/papers/v11/vinh10a
Examples
x <- rep(0:5, 5)
y <- c(rep(0:5, 4), rep(0, 6))
# Simple Matching, or Accuracy
mean(x == y)
# Hamming distance
sum(x != y)
rand_index(x, y)
adj_rand(x, y)
cohen_kappa(x, y)
normalized_mi(x, y)
adjusted_mi(x, y)