R: Measures for Comparing Clusterings

mclustcomp {mclustcomp}

R Documentation

Measures for Comparing Clusterings

Description

Given two partitions or clusterings C_1 and C_2, it returns community comparison scores corresponding with a set of designated methods. Note that two label vectors should be of same length having either numeric or factor type. Currently we have 3 classes of methods depending on methodological philosophy behind each. See below for the taxonomy.

Usage

mclustcomp(x, y, types = "all", tversky.param = list())

Arguments

`x`, `y`	vectors of clustering labels
`types`	`"all"` for returning scores for every available measure. Either a single score name or a vector of score names can be supplied. See the section for the list of the methods for details.
`tversky.param`	a list of parameters for Tversky index; `alpha` and `beta` for weight parameters, and `sym`, a logical where `FALSE` stands for original method, `TRUE` for a revised variant to symmetrize the score. Default (alpha,beta)=(1,1).

Value

a data frame with columns types and corresponding scores.

Category 1. Counting Pairs

TYPE	FULL NAME
`'adjrand'`	Adjusted Rand index.
`'chisq'`	Chi-Squared Coefficient.
`'fmi'`	Fowlkes-Mallows index.
`'jaccard'`	Jaccard index.
`'mirkin'`	Mirkin Metric, or Equivalence Mismatch Distance.
`'overlap'`	Overlap Coefficient, or Szymkiewicz-Simpson coefficient.
`'pd'`	Partition Difference.
`'rand'`	Rand Index.
`'sdc'`	Sørensen–Dice Coefficient.
`'smc'`	Simple Matching Coefficient.
`'tanimoto'`	Tanimoto index.
`'tversky'`	Tversky index.
`'wallace1'`	Wallace Criterion Type 1.
`'wallace2'`	Wallace Criterion Type 2.

Note that Tanimoto Coefficient and Dice's coefficient are special cases with (alpha,beta) = (1,1) and (0.5,0.5), respectively.

Category 2. Set Overlaps/Matching

TYPE	FULL NAME
`'f'`	F-Measure.
`'mhm'`	Meila-Heckerman Measure.
`'mmm'`	Maximum-Match Measure.
`'vdm'`	Van Dongen Measure.

Category 3. Information Theory

TYPE	FULL NAME
`'jent'`	Joint Entropy
`'mi'`	Mutual Information.
`'nmi1'`	Normalized Mutual Information by Strehl and Ghosh.
`'nmi2'`	Normalized Mutual Information by Fred and Jain.
`'nmi3'`	Normalized Mutual Information by Danon et al.
`'nvi'`	Normalized Variation of Information.
`'vi'`	Variation of Information.

References

Strehl A, Ghosh J (2003). “Cluster Ensembles — a Knowledge Reuse Framework for Combining Multiple Partitions.” J. Mach. Learn. Res., 3, 583–617. ISSN 1532-4435.

Meilă M (2007). “Comparing clusterings—an information based distance.” Journal of Multivariate Analysis, 98(5), 873–895. ISSN 0047259X.

Meilă M (2003). “Comparing Clusterings by the Variation of Information.” In Goos G, Hartmanis J, van Leeuwen J, Schölkopf B, Warmuth MK (eds.), Learning Theory and Kernel Machines, volume 2777, 173–187. Springer Berlin Heidelberg, Berlin, Heidelberg. ISBN 978-3-540-40720-1 978-3-540-45167-9.

Wagner S, Wagner D (2007). “Comparing Clusterings – An Overview.” Technical Report 2006-04, Department of Informatics.

Albatineh AN, Niewiadomska-Bugaj M, Mihalko D (2006). “On Similarity Indices and Correction for Chance Agreement.” Journal of Classification, 23(2), 301–313. ISSN 0176-4268, 1432-1343.

Mirkin B (2001). “Eleven Ways to Look at the Chi-Squared Coefficient for Contingency Tables.” The American Statistician, 55(2), 111–120. ISSN 0003-1305, 1537-2731.

Rand WM (1971). “Objective Criteria for the Evaluation of Clustering Methods.” Journal of the American Statistical Association, 66(336), 846. ISSN 01621459.

Kuncheva LI, Hadjitodorov ST (2004). “Using diversity in cluster ensembles.” In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, volume 2, 1214–1219. ISBN 978-0-7803-8567-2.

Fowlkes EB, Mallows CL (1983). “A Method for Comparing Two Hierarchical Clusterings.” Journal of the American Statistical Association, 78(383), 553–569. ISSN 0162-1459, 1537-274X.

Dongen S (2000). “Performance Criteria for Graph Clustering and Markov Cluster Experiments.” CWI (Centre for Mathematics and Computer Science), Amsterdam, The Netherlands, The Netherlands.

Jaccard P (1912). “THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1.” New Phytologist, 11(2), 37–50. ISSN 0028-646X, 1469-8137.

Li T, Ogihara M, Ma S (2010). “On combining multiple clusterings: an overview and a new perspective.” Applied Intelligence, 33(2), 207–219. ISSN 0924-669X, 1573-7497.

Larsen B, Aone C (1999). “Fast and effective text mining using linear-time document clustering.” In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, 16–22. ISBN 978-1-58113-143-7.

Meilă M, Heckerman D (2001). “An Experimental Comparison of Model-Based Clustering Methods.” Machine Learning, 42(1), 9–29. ISSN 1573-0565.

Cover TM, Thomas JA (2006). Elements of information theory, 2nd ed edition. Wiley-Interscience, Hoboken, N.J. ISBN 978-0-471-24195-9, OCLC: ocm59879802.

Ana LNF, Jain AK (2003). “Robust data clustering.” In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, II–128–II–133. ISBN 978-0-7695-1900-5.

Wallace DL (1983). “Comment.” Journal of the American Statistical Association, 78(383), 569–576. ISSN 0162-1459, 1537-274X.

Simpson GG (1943). “Mammals and the nature of continents.” American Journal of Science, 241, 1–31.

Dice LR (1945). “Measures of the Amount of Ecologic Association Between Species.” Ecology, 26(3), 297–302. ISSN 00129658.

Segaran T (2007). Programming collective intelligence: building smart web 2.0 applications, 1st ed edition. O'Reilly, Beijing ; Sebastapol [CA]. ISBN 978-0-596-52932-1, OCLC: ocn166886837.

Tversky A (1977). “Features of similarity.” Psychological Review, 84(4), 327–352. ISSN 0033-295X.

Danon L, Díaz-Guilera A, Duch J, Arenas A (2005). “Comparing community structure identification.” Journal of Statistical Mechanics: Theory and Experiment, 2005(09), P09008–P09008. ISSN 1742-5468.

Lancichinetti A, Fortunato S, Kertész J (2009). “Detecting the overlapping and hierarchical community structure in complex networks.” New Journal of Physics, 11(3), 033015. ISSN 1367-2630.

Examples

## example 1. compare two identical clusterings
x = sample(1:5,20,replace=TRUE) # label from 1 to 5, 10 elements
y = x                           # set two labels x and y equal
mclustcomp(x,y)                 # show all results

## example 2. selection of a few methods
z = sample(1:4,20,replace=TRUE)           # generate a non-trivial clustering
cmethods = c("jaccard","tanimoto","rand") # select 3 methods
mclustcomp(x,z,types=cmethods)            # test with the selected scores

## example 3. tversky.param
tparam = list()                           # create an empty list
tparam$alpha = 2
tparam$beta  = 3
tparam$sym   = TRUE
mclustcomp(x,z,types="tversky")           # default set as Tanimoto case.
mclustcomp(x,z,types="tversky",tversky.param=tparam)

[Package mclustcomp version 0.3.3 Index]