cl_dissimilarity {clue} | R Documentation |
Dissimilarity Between Partitions or Hierarchies
Description
Compute the dissimilarity between (ensembles) of partitions or hierarchies.
Usage
cl_dissimilarity(x, y = NULL, method = "euclidean", ...)
Arguments
x |
an ensemble of partitions or hierarchies and dissimilarities,
or something coercible to that (see |
y |
|
method |
a character string specifying one of the built-in
methods for computing dissimilarity, or a function to be taken as
a user-defined method. If a character string, its lower-cased
version is matched against the lower-cased names of the available
built-in methods using |
... |
further arguments to be passed to methods. |
Details
If y
is given, its components must be of the same kind as those
of x
(i.e., components must either all be partitions, or all be
hierarchies or dissimilarities).
If all components are partitions, the following built-in methods for
measuring dissimilarity between two partitions with respective
membership matrices u
and v
(brought to a common number of
columns) are available:
"euclidean"
the Euclidean dissimilarity of the memberships, i.e., the square root of the minimal sum of the squared differences of
u
and all column permutations ofv
. See Dimitriadou, Weingessel and Hornik (2002)."manhattan"
the Manhattan dissimilarity of the memberships, i.e., the minimal sum of the absolute differences of
u
and all column permutations ofv
."comemberships"
the Euclidean dissimilarity of the elements of the co-membership matrices
C(u) = u u'
andC(v)
, i.e., the square root of the sum of the squared differences ofC(u)
andC(v)
."symdiff"
the cardinality of the symmetric set difference of the sets of co-classified pairs of distinct objects in the partitions. I.e., the number of distinct pairs of objects in the same class in exactly one of the partitions. (Alternatively, the cardinality of the symmetric set difference between the (binary) equivalence relations corresponding to the partitions.) For soft partitions, (currently) the symmetric set difference of the corresponding nearest hard partitions is used.
"Rand"
the Rand distance, i.e., the rate of distinct pairs of objects in the same class in exactly one of the partitions. (Related to the Rand index
a
via the linear transformationd = (1 - a) / 2
.) For soft partitions, (currently) the Rand distance of the corresponding nearest hard partitions is used."GV1"
the square root of the dissimilarity
\Delta_1
used for the first model in Gordon and Vichi (2001), i.e., the square root of the minimal sum of the squared differences of the matched non-zero columns ofu
andv
."BA/d"
distance measures for hard partitions discussed in Boorman and Arabie (1972), with d one of ‘A’, ‘C’, ‘D’, or ‘E’. For soft partitions, the distances of the corresponding nearest hard partitions are used.
"BA/A"
is the minimum number of single element moves (move from one class to another or a new one) needed to transform one partition into the other. Introduced in Rubin (1967)."BA/C"
is the minimum number of lattice moves for transforming one partition into the other, where partitions are said to be connected by a lattice move if one is just finer than the other (i.e., there is no other partition between them) in the partition lattice (seecl_meet
). Equivalently, withz
the join ofx
andy
andS
giving the number of classes, this can be written asS(x) + S(y) - 2 S(z)
. Attributed to David Pavy."BA/D"
is the “pair-bonds” distance, which can be defined asS(x) + S(y) - 2 S(z)
, withz
the meet ofx
andy
andS
the supervaluation (i.e., non-decreasing with respect to the partial order on the partition lattice) function\sum_i (n_i (n_i - 1)) / (n (n - 1))
, where then_i
are the numbers of objects in the respective classes of the partition (such thatn_i (n_i - 1) / 2
are the numbers of pair bonds in the classes), andn
the total number of objects."BA/E"
is the normalized information distance, defined as1 - I / H
, whereI
is the average mutual information between the partitions, andH
is the average entropy of the meetz
of the partitions. Introduced in Rajski (1961).(Boorman and Arabie also discuss a distance measure (
B
) based on the minimum number of set moves needed to transform one partition into the other, which, differently from theA
andC
distance measures is hard to compute (Day, 1981) and (currently) not provided.)"VI"
Variation of Information, see Meila (2003). If
...
has an argument namedweights
, it is taken to specify case weights."Mallows"
the Mallows-type distance by Zhou, Li and Zha (2005), which is related to the Monge-Kantorovich mass transfer problem, and given as the
p
-th root of the minimal value of the transportation problem\sum w_{jk} \sum_i |u_{ij} - v_{ik}| ^ p
with constraintsw_{jk} \ge 0
,\sum_j w_{jk} = \alpha_j
,\sum_k w_{jk} = \beta_k
, where\sum_j \alpha_j = \sum_k \beta_k
. The parametersp
,\alpha
and\beta
all default to one (in this case, the Mallows distance coincides with the Manhattan dissimilarity), and can be specified via additional arguments namedp
,alpha
, andbeta
, respectively."CSSD"
the Cluster Similarity Sensitive Distance of Zhou, Li and Zha (2005), which is given as the minimal value of
\sum_{k,l} (1 - 2 w_{kl} / (\alpha_k + \beta_l)) L_{kl}
, whereL_{kl} = \sum_i u_{ik} v_{il} d(p_{x;k}, p_{y;l})
withp_{x;k}
andp_{y;l}
the prototype of thek
-th class ofx
and thel
-th class ofy
, respectively,d
is the distance between these, and thew_{kl}
as for Mallows distance. If prototypes are matrices, the Euclidean distance between these is used as default. Using the additional argumentL
, one can give a matrix ofL_{kl}
values, or the functiond
. Parameters\alpha
and\beta
all default to one, and can be specified via additional arguments namedalpha
andbeta
, respectively.
For hard partitions, both Manhattan and squared Euclidean
dissimilarity give twice the transfer distance (Charon et al.,
2005), which is the minimum number of objects that must be removed so
that the implied partitions (restrictions to the remaining objects)
are identical. This is also known as the R
-metric in Day
(1981), i.e., the number of augmentations and removals of single
objects needed to transform one partition into the other, and the
partition-distance in Gusfield (2002), and equals twice the
number of single element moves distance of Boorman and Arabie.
For hard partitions, the pair-bonds (Boorman-Arabie D
) distance
is identical to the Rand distance, and can also be written as the
Manhattan distance between the co-membership matrices corresponding to
the partitions, or equivalently, their symdiff distance, normalized by
n (n - 1)
.
If all components are hierarchies, available built-in methods for
measuring dissimilarity between two hierarchies with respective
ultrametrics u
and v
are as follows.
"euclidean"
the Euclidean dissimilarity of the ultrametrics (i.e., the square root of the sum of the squared differences of
u
andv
)."manhattan"
the Manhattan dissimilarity of the ultrametrics (i.e., the sum of the absolute differences of
u
andv
)."cophenetic"
1 - c^2
, wherec
is the cophenetic correlation coefficient (i.e., the product-moment correlation of the ultrametrics)."gamma"
the rate of inversions between the ultrametrics (i.e., the rate of pairs
(i,j)
and(k,l)
for whichu_{ij} < u_{kl}
andv_{ij} > v_{kl}
)."symdiff"
the cardinality of the symmetric set difference of the sets of classes (hierarchies in the strict sense) induced by the dendrograms. I.e., the number of sets of objects obtained by a split in exactly one of the hierarchies.
"Chebyshev"
the Chebyshev (maximal) dissimilarity of the ultrametrics (i.e., the maximum of the absolute differences of
u
andv
)."Lyapunov"
the logarithm of the product of the maximal and minimal ratios of the ultrametrics. This is also known as the “Hilbert projective metric” on the cone represented by the ultrametrics (e.g., Jardine & Sibson (1971), page 107), and only defined for strict ultrametrics (which are strictly positive for distinct objects).
"BO"
the
m_\delta
family of tree metrics by Boorman and Olivier (1973), which are of the formm_\delta = \int_0^\infty \delta(p(h), q(h)) dh
, wherep(h)
andq(h)
are the hard partitions obtaining by cutting the trees (dendrograms) at heighth
, and\delta
is a suitably dissimilarity measure for partitions. In particular, when taking\delta
as symdiff or Rand dissimilarity,m_\delta
is the Manhattan dissimilarity of the hierarchies.If
...
has an argument nameddelta
it is taken to specify the partition dissimilarity\delta
to be employed."spectral"
the spectral norm (2-norm) of the differences of the ultrametrics, suggested in Mérigot, Durbec, and Gaertner (2010).
The measures based on ultrametrics also allow computing dissimilarity
with “raw” dissimilarities on the underlying objects (R objects
inheriting from class "dist"
).
If a user-defined dissimilarity method is to be employed, it must be a function taking two clusterings as its arguments.
Symmetric dissimilarity objects of class "cl_dissimilarity"
are
implemented as symmetric proximity objects with self-proximities
identical to zero, and inherit from class "cl_proximity"
. They
can be coerced to dense square matrices using as.matrix
. It
is possible to use 2-index matrix-style subscripting for such objects;
unless this uses identical row and column indices, this results in a
(non-symmetric dissimilarity) object of class
"cl_cross_dissimilarity"
.
Symmetric dissimilarity objects also inherit from class
"dist"
(although they currently do not “strictly”
extend this class), thus making it possible to use them directly for
clustering algorithms based on dissimilarity matrices of this class,
see the examples.
Value
If y
is NULL
, an object of class
"cl_dissimilarity"
containing the dissimilarities between all
pairs of components of x
. Otherwise, an object of class
"cl_cross_dissimilarity"
with the dissimilarities between the
components of x
and the components of y
.
References
S. A. Boorman and P. Arabie (1972). Structural measures and the method of sorting. In R. N. Shepard, A. K. Romney, & S. B. Nerlove (eds.), Multidimensional Scaling: Theory and Applications in the Behavioral Sciences, 1: Theory (pages 225–249). New York: Seminar Press.
S. A. Boorman and D. C. Olivier (1973). Metrics on spaces of finite trees. Journal of Mathematical Psychology, 10, 26–59. doi:10.1016/0022-2496(73)90003-5.
I. Charon, L. Denoeud, A. Guénoche and O. Hudry (2006). Maximum Transfer Distance Between Partitions. Journal of Classification, 23, 103–121. doi:10.1007/s00357-006-0006-2.
W. E. H. Day (1981). The complexity of computing metric distances between partitions. Mathematical Social Sciences, 1, 269–287. doi:10.1016/0165-4896(81)90042-1.
E. Dimitriadou, A. Weingessel and K. Hornik (2002).
A combination scheme for fuzzy clustering.
International Journal of Pattern Recognition and Artificial
Intelligence, 16, 901–912.
doi:10.1142/S0218001402002052.
A. D. Gordon and M. Vichi (2001). Fuzzy partition models for fitting a set of partitions. Psychometrika, 66, 229–248. doi:10.1007/BF02294837.
D. Gusfield (2002). Partition-distance: A problem and class of perfect graphs arising in clustering. Information Processing Letters, 82, 159–164. doi:10.1016/S0020-0190(01)00263-0.
N. Jardine and E. Sibson (1971). Mathematical Taxonomy. London: Wiley.
M. Meila (2003). Comparing clusterings by the variation of information. In B. Schölkopf and M. K. Warmuth (eds.), Learning Theory and Kernel Machines, pages 173–187. Springer-Verlag: Lecture Notes in Computer Science 2777.
B. Mérigot, J.-P. Durbec and J.-C. Gaertner (2010). On goodness-of-fit measure for dendrogram-based analyses. Ecology, 91, 1850—-1859. doi:10.1890/09-1387.1.
C. Rajski (1961). A metric space of discrete probability distributions, Information and Control, 4, 371–377. doi:10.1016/S0019-9958(61)80055-7.
J. Rubin (1967). Optimal classification into groups: An approach for solving the taxonomy problem. Journal of Theoretical Biology, 15, 103–144. doi:10.1016/0022-5193(67)90046-X.
D. Zhou, J. Li and H. Zha (2005). A new Mallows distance based metric for comparing clusterings. In Proceedings of the 22nd international Conference on Machine Learning (Bonn, Germany, August 07–11, 2005), pages 1028–1035. ICML '05, volume 119. ACM Press, New York, NY. doi:10.1145/1102351.1102481.
See Also
Examples
## An ensemble of partitions.
data("CKME")
pens <- CKME[1 : 30]
diss <- cl_dissimilarity(pens)
summary(c(diss))
cl_dissimilarity(pens[1:5], pens[6:7])
## Equivalently, using subscripting.
diss[1:5, 6:7]
## Can use the dissimilarities for "secondary" clustering
## (e.g. obtaining hierarchies of partitions):
hc <- hclust(diss)
plot(hc)
## Example from Boorman and Arabie (1972).
P1 <- as.cl_partition(c(1, 2, 2, 2, 3, 3, 2, 2))
P2 <- as.cl_partition(c(1, 1, 2, 2, 3, 3, 4, 4))
cl_dissimilarity(P1, P2, "BA/A")
cl_dissimilarity(P1, P2, "BA/C")
## Hierarchical clustering.
d <- dist(USArrests)
x <- hclust(d)
cl_dissimilarity(x, d, "cophenetic")
cl_dissimilarity(x, d, "gamma")