R: Wiroonsri(2024) correlation-based cluster validity indices

Wvalid {UniversalCVI}

R Documentation

Wiroonsri(2024) correlation-based cluster validity indices

Description

Computes the NC correlation, NCI, NCI1 and NCI2 cluster validity indices for the number of clusters from user specified kmin to kmax obtained from either K-means or hierarchical clustering based on the recent paper by Wiroonsri(2024).

Usage

Wvalid(x, kmax, kmin = 2, method = "kmeans",
  corr = "pearson", nstart = 100, sampling = 1, NCstart = TRUE)

Arguments

`x`	a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point.
`kmax`	a maximum number of clusters to be considered.
`kmin`	a minimum number of clusters to be considered. The default is `2`.
`method`	a character string indicating which clustering method to be used (`"kmeans"`, `"hclust_complete"`, `"hclust_average"`, `"hclust_single"`). The default is `"kmeans"`.
`corr`	a character string indicating which correlation coefficient is to be computed (`"pearson"`, `"kendall"` or `"spearman"`). The default is `"pearson"`.
`nstart`	a maximum number of initial random sets for kmeans for `method = "kmeans"`. The default is `100`.
`sampling`	a number greater than 0 and less than or equal to 1 indicating the undersampling proportion of data to be used. This argument is intended for handling a large dataset. The default is `1`.
`NCstart`	logical for `indexlist` includes the `"NC"`, `"NCI"`, `"NCI1"`, and `"NCI2"`), if `TRUE`, the NC correlation at `k=1` is defined as the ratio introduced in the reference. Otherwise, it is assigned as `0`.

Details

The NC correlation computes the correlation between an actual distance between a pair of data points and a centroid distance of clusters that the two points locate in. NCI1 and NCI2 are the proportion and the subtraction, respectively, of the same two ratios. The first ratio is the NC improvement from k-1 clusters to k clusters over the entire room for improvement. The second ratio is the NC improvement from k clusters to k+1 clusters over the entire room for improvement. NCI is a combination of NCI1 and NCI2.

Value

`NC`	the NC correlations for `k` from `kmin-1` to `kmax+1` shown in a data frame where the first and the second columns are `k` and the NC, respectively.

Each of the followings shows the values of each index for k from kmin to kmax in a data frame.

`NCI`	the NCI index.
`NCI1`	the NCI1 index.
`NCI2`	the NCI2 index.

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

N. Wiroonsri, "Clustering performance analysis using a new correlation based cluster validity index," Pattern Recognition, 145, 109910, 2024. doi:10.1016/j.patcog.2023.109910

Examples


library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- Kmeans ----

# Compute all the indices by Wvalid
K.NC = Wvalid(scale(x), kmax = 15, kmin=2, method = 'kmeans',
  corr='pearson', nstart=100, NCstart = TRUE)
print(K.NC)

# The optimal number of cluster
K.NC$NCI[which.max(K.NC$NCI$NCI),]

# ---- Hierarchical ----

# Average linkage

# Compute all the indices by Wvalid
H.NC = Wvalid(scale(x), kmax = 15, kmin=2, method = 'hclust_average',
  corr='pearson', nstart=100, NCstart = TRUE)
print(H.NC)

# The optimal number of cluster
H.NC$NCI[which.max(H.NC$NCI$NCI),]

[Package UniversalCVI version 1.1.2 Index]