gMADD_DI {HDLSSkST}R Documentation

Modified K-Means Algorithm by Using a New Dissimilarity Measure, MADD and DUNN Index

Description

Performs modified K-means algorithm by using a new dissimilarity measure, called MADD and DUNN index, and provides estimated cluster (class) labels or memberships and corresponding DUNN index of the observations.

Usage

gMADD_DI(s_psi, s_h, kmax, lb, M)

Arguments

s_psi

function required for clustering, 1 for t^2, 2 for 1-\exp(-t), 3 for 1-\exp(-t^2), 4 for \log(1+t), 5 for t

s_h

function required for clustering, 1 for \sqrt t, 2 for t

kmax

maximum value of total number of clusters to estimate total number of clusters in the whole observations

lb

each observation is partitioned into some numbers of smaller vectors of same length lb

M

n\times d observations matrix of pooled sample, the observations should be grouped by their respective classes

Details

DUNN index is used for cluster validation, but here we use it to estimate total number of cluster k by \hat k = argmax_{2\le k' \le k^*}DI(k'). Here DI(k') represents the DUNN index and we use k^*=2*k.

Value

a kmax \times (n+1) matrix of the estimated cluster (class) labels and corresponding DUNN indexes of observations

Note

The result of this gMADD_DI function is a matrix. The 1st row of this matrix doesn't provide anything about estimated class labels or DUNN index of observations since the DUNN index is only defined for k\ge 2. The last column of this matrix represents the DUNN indexes. The estimated cluster labels of observations are calculated by finding out the corresponding row of maximum DUNN index.

Author(s)

Biplab Paul, Shyamal K. De and Anil K. Ghosh

Maintainer: Biplab Paul<paul.biplab497@gmail.com>

References

Biplab Paul, Shyamal K De and Anil K Ghosh (2021). Some clustering based exact distribution-free k-sample tests applicable to high dimension, low sample size data, Journal of Multivariate Analysis, doi:10.1016/j.jmva.2021.104897.

Soham Sarkar and Anil K Ghosh (2019). On perfect clustering of high dimension, low sample size data, IEEE transactions on pattern analysis and machine intelligence, doi:10.1109/TPAMI.2019.2912599.

Joseph C Dunn (1973). A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters, doi:10.1080/01969727308546046.

Examples

  # Modified K-means algorithm:
  # muiltivariate normal distribution
  # generate data with dimension d = 500
  set.seed(151)
  n1=n2=n3=n4=10
  d = 500
  I1 <- matrix(rnorm(n1*d,mean=0,sd=1),n1,d)
  I2 <- matrix(rnorm(n2*d,mean=0.5,sd=1),n2,d) 
  I3 <- matrix(rnorm(n3*d,mean=1,sd=1),n3,d) 
  I4 <- matrix(rnorm(n4*d,mean=1.5,sd=1),n4,d) 
  n_cl <- 4
  N <- n1+n2+n3+n4
  X <- as.matrix(rbind(I1,I2,I3,I4)) 
  dvec_di_mat <-  gMADD_DI(1,1,2*n_cl,1,X)
  est_no_cl <- which.max(dvec_di_mat[ ,(N+1)])
  dvec_di_mat[est_no_cl,1:N]

   ## outputs:
   #[1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
   

[Package HDLSSkST version 2.1.0 Index]