distrsimilarity {fpc} | R Documentation |
Similarity of within-cluster distributions to normal and uniform
Description
Two measures of dissimilarity between the within-cluster distributions of a dataset and normal or uniform distribution. For the normal it's the Kolmogorov dissimilarity between the Mahalanobis distances to the center and a chi-squared distribution. For the uniform it is the Kolmogorov distance between the distance to the kth nearest neighbour and a Gamma distribution (this is based on Byers and Raftery (1998)). The clusterwise values are aggregated by weighting with the cluster sizes.
Usage
distrsimilarity(x,clustering,noisecluster = FALSE,
distribution=c("normal","uniform"),nnk=2,
largeisgood=FALSE,messages=FALSE)
Arguments
x |
the data matrix; a numerical object which can be coerced to a matrix. |
clustering |
integer vector of class numbers; length must equal
|
noisecluster |
logical. If |
distribution |
vector of |
nnk |
integer. Number of nearest neighbors to use for dissimilarity to the uniform. |
largeisgood |
logical. If |
messages |
logical. If |
Value
List with the following components
kdnorm |
Kolmogorov distance between distribution of within-cluster Mahalanobis distances and appropriate chi-squared distribution, aggregated over clusters (I am grateful to Agustin Mayo-Iscar for the idea). |
kdunif |
Kolmogorov distance between distribution of distances to
|
kdnormc |
vector of cluster-wise Kolmogorov distances between distribution of within-cluster Mahalanobis distances and appropriate chi-squared distribution. |
kdunifc |
vector of cluster-wise Kolmogorov distances between
distribution of distances to |
xmahal |
vector of Mahalanobs distances to the respective cluster center. |
xdknn |
vector of distance to |
Note
It is very hard to capture similarity to a multivariate normal or uniform in a single value, and both used here have their shortcomings. Particularly, the dissimilarity to the uniform can still indicate a good fit if there are holes or it's a uniform distribution concentrated on several not connected sets.
Author(s)
Christian Hennig christian.hennig@unibo.it https://www.unibo.it/sitoweb/christian.hennig/en/
References
Byers, S. and Raftery, A. E. (1998) Nearest-Neighbor Clutter Removal for Estimating Features in Spatial Point Processes, Journal of the American Statistical Association, 93, 577-584.
Hennig, C. (2017) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Proceedings of ASMDA 2017, 501-520, https://arxiv.org/abs/1703.09282
See Also
cqcluster.stats
,cluster.stats
for more cluster validity statistics.
Examples
set.seed(20000)
options(digits=3)
face <- rFace(200,dMoNo=2,dNoEy=0,p=2)
km3 <- kmeans(face,3)
distrsimilarity(face,km3$cluster)