Wvalid {UniversalCVI}R Documentation

Wiroonsri(2024) correlation-based cluster validity indices

Description

Computes the NC correlation, NCI, NCI1 and NCI2 cluster validity indices for the number of clusters from user specified kmin to kmax obtained from either K-means or hierarchical clustering based on the recent paper by Wiroonsri(2024).

Usage

Wvalid(x, kmax, kmin = 2, method = "kmeans",
  corr = "pearson", nstart = 100, sampling = 1, NCstart = TRUE)

Arguments

x

a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point.

kmax

a maximum number of clusters to be considered.

kmin

a minimum number of clusters to be considered. The default is 2.

method

a character string indicating which clustering method to be used ("kmeans", "hclust_complete", "hclust_average", "hclust_single"). The default is "kmeans".

corr

a character string indicating which correlation coefficient is to be computed ("pearson", "kendall" or "spearman"). The default is "pearson".

nstart

a maximum number of initial random sets for kmeans for method = "kmeans". The default is 100.

sampling

a number greater than 0 and less than or equal to 1 indicating the undersampling proportion of data to be used. This argument is intended for handling a large dataset. The default is 1.

NCstart

logical for indexlist includes the "NC", "NCI", "NCI1", and "NCI2"), if TRUE, the NC correlation at k=1 is defined as the ratio introduced in the reference. Otherwise, it is assigned as 0.

Details

The NC correlation computes the correlation between an actual distance between a pair of data points and a centroid distance of clusters that the two points locate in. NCI1 and NCI2 are the proportion and the subtraction, respectively, of the same two ratios. The first ratio is the NC improvement from k-1 clusters to k clusters over the entire room for improvement. The second ratio is the NC improvement from k clusters to k+1 clusters over the entire room for improvement. NCI is a combination of NCI1 and NCI2.

Value

NC

the NC correlations for k from kmin-1 to kmax+1 shown in a data frame where the first and the second columns are k and the NC, respectively.

Each of the followings shows the values of each index for k from kmin to kmax in a data frame.

NCI

the NCI index.

NCI1

the NCI1 index.

NCI2

the NCI2 index.

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

N. Wiroonsri, "Clustering performance analysis using a new correlation based cluster validity index," Pattern Recognition, 145, 109910, 2024. doi:10.1016/j.patcog.2023.109910

See Also

Hvalid, FzzyCVIs, DB.IDX, R1_data

Examples


library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- Kmeans ----

# Compute all the indices by Wvalid
K.NC = Wvalid(scale(x), kmax = 15, kmin=2, method = 'kmeans',
  corr='pearson', nstart=100, NCstart = TRUE)
print(K.NC)

# The optimal number of cluster
K.NC$NCI[which.max(K.NC$NCI$NCI),]

# ---- Hierarchical ----

# Average linkage

# Compute all the indices by Wvalid
H.NC = Wvalid(scale(x), kmax = 15, kmin=2, method = 'hclust_average',
  corr='pearson', nstart=100, NCstart = TRUE)
print(H.NC)

# The optimal number of cluster
H.NC$NCI[which.max(H.NC$NCI$NCI),]

[Package UniversalCVI version 1.1.2 Index]