Hvalid {UniversalCVI} | R Documentation |
Wiroonsri(2024) correlation-based cluster validity indices and other well-known cluster validity indices
Description
Computes the cluster validity indexes for a result of either kmeans or hierarchical clustering from user specified kmin
to kmax
used in Wiroonsri(2024). It includes the DI (J. C. Dunn, 1973) index, CH (T. Calinski and J. Harabasz, 1974) index, DB (D. L. Davies and D. W. Bouldin, 1979) index, PB (G. W. Miligan, 1985) index, CSL (C. H. Chou et al., 2004) index, PBM (M. K. Pakhira et al., 2004) index, DBs (M. Kim and R. S. Ramakrishna, 2005), Score function (S. Saitta et al., 2007), STR (A. Starczewski, 2017) index, NC, NCI, NCI1, and, NCI2 (N. Wiroonsri, 2024) indexes.
Usage
Hvalid(x, kmax, kmin = 2, indexlist = "all", method = "kmeans",
p = 2, q = 2, corr = "pearson", nstart = 100, sampling = 1, NCstart = TRUE)
Arguments
x |
a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point. |
kmax |
a maximum number of clusters to be considered. |
kmin |
a minimum number of clusters to be considered. The default is |
indexlist |
a character string indicating which cluster validity indexes to be computed ( |
method |
a character string indicating which clustering method to be used ( |
p |
the power of the Minkowski distance between centroids of clusters for |
q |
the power of dispersion measure of a cluster for |
corr |
a character string indicating which correlation coefficient is to be computed ( |
nstart |
a maximum number of initial random sets for kmeans for |
sampling |
a number greater than 0 and less than or equal to 1 indicating the undersampling proportion of data to be used. This argument is intended for handling a large dataset. The default is |
NCstart |
logical for |
Details
The well-known cluster validity indices used in Wiroonsri(2024). It includes the DI (J. C. Dunn, 1973) index, CH (T. Calinski and J. Harabasz, 1974) index, DB (D. L. Davies and D. W. Bouldin, 1979) index, PB (G. W. Miligan, 1980) index, CSL (C. H. Chou et al., 2004) index, PBM (M. K. Pakhira et al., 2004) index, DBs (M. Kim and R. S. Ramakrishna, 2005), Score function (S. Saitta et al., 2007), STR (A. Starczewski, 2017), NC, NCI, NCI1, and, NCI2 (N. Wiroonsri, 2024) indexes.
The NC correlation computes the correlation between an actual distance between a pair of data points and a centroid distance of clusters that the two points locate in. NCI1 and NCI2 are the proportion and the subtraction, respectively, of the same two ratios. The first ratio is the NC improvement from k-1
clusters to k
clusters over the entire room for improvement. The second ratio is the NC improvement from k
clusters to k+1
clusters over the entire room for improvement. NCI is a combination of NCI1 and NCI2.
Value
NC |
the NC correlations for |
Each of the followings shows the values of each index for k
from kmin
to kmax
in a data frame.
NCI |
the NCI index. |
NCI1 |
the NCI1 index. |
NCI2 |
the NCI2 index. |
PB |
the PB index. |
DI |
the DI index. |
DB |
the DB index. |
DBs |
the DBs index. |
CSL |
the CSL index. |
CH |
the CH index. |
SF |
the Score function. |
STR |
the STR index. |
PBM |
the PBM index. |
Author(s)
Nathakhun Wiroonsri and Onthada Preedasawakul
References
J. C. Bezdek, N. R. Pal, "Some new indexes of cluster validity," IEEE Transactions on Systems, Man, and Cybernetics, Part B, 28, 301-315 (1998).
T. Calinski, J. Harabasz, "A dendrite method for cluster analysis," Communications in Statistics, 3, 1-27 (1974).
C. H. Chou, M. C. Su, E. Lai, "A new cluster validity measure and its application to image compression," Pattern Anal Applic, 7, 205-220 (2004).
D. L. Davies, D. W. Bouldin, "A cluster separation measure," IEEE Trans Pattern Anal Machine Intell, 1, 224-227 (1979).
J. C. Dunn, "A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters," J Cybern, 3(3), 32-57 (1973).
M. Kim, R. S. Ramakrishna, "New indices for cluster validity assessment," Pattern Recognition Letters, 26, 2353-2363 (2005).
G. W. Miligan, "An examination of the effect of six types of error perturbation on fifteen clustering algorithms," Psychometrika, 45, 325-342 (1980).
M. K. Pakhira, S. Bandyopadhyay and U. Maulik, "Validity index for crisp and fuzzy clusters," Pattern Recogn 37(3):487–501 (2004).
S. Saitta, B. Raphael, I. Smith, "A bounded index for cluster validity," In Perner, P.: Machine Learning and Data Mining in Pattern Recognition, Lecture Notes in Computer Science, 4571, Springer (2007).
A. Starczewski, "A new validity index for crisp clusters," Pattern Anal Applic 20, 687–700 (2017).
N. Wiroonsri, "Clustering performance analysis using a new correlation based cluster validity index," Pattern Recognition, 145, 109910, 2024.
See Also
Wvalid, FzzyCVIs, DI.IDX, R1_data
Examples
library(UniversalCVI)
# The data is from Wiroonsri (2024).
x = R1_data[,1:2]
# ---- Kmeans ----
# Compute all the indices by Hvalid
Hvalid(scale(x), kmax = 15, kmin = 2, indexlist = "all",
method = "kmeans", p = 2, q = 2, corr = "pearson", nstart = 100, NCstart = TRUE)
# Compute selected a set of indices ("NC","NCI","DI","DB")
Hvalid(scale(x), kmax = 15, kmin = 2, indexlist = c("NC","NCI","DI","DB"),
method = "kmeans", p = 2, q = 2, corr = "pearson", nstart = 100, NCstart = TRUE)
# ---- Hierarchical ----
# Average linkage
# Compute all the indices by Hvalid
Hvalid(scale(x), kmax = 15, kmin = 2, indexlist = "all",
method = "hclust_average", p = 2, q = 2, corr = "pearson", nstart = 100, NCstart = TRUE)
# Compute selected a set of indices ("NC","NCI","DI","DB")
Hvalid(scale(x), kmax = 15, kmin = 2, indexlist = c("NC","NCI","DI","DB"),
method = "hclust_average", p = 2, q = 2, corr = "pearson", nstart = 100, NCstart = TRUE)
#---Plot and compare the indexes---
# Compute six cluster validity indexes of a kmeans clustering result for k from 2 to 15
IDX.list = c("NCI", "DI", "DB", "DBs", "CSL", "CH")
Hvalid.result = Hvalid(scale(x), kmax = 15, kmin = 2, indexlist = IDX.list,
method = "hclust_average", p = 2, q = 2, corr = "pearson", nstart = 100, NCstart = TRUE)
# Plot the computed indexes
plot_idx(Hvalid.result)