validation_kproto {clustMixType}  R Documentation 
Calculating the prefered validation index for a kPrototypes clustering with k clusters or computing the optimal number of clusters based on the choosen index for kPrototype clustering. Possible validation indices are: cindex
, dunn
, gamma
, gplus
, mcclain
, ptbiserial
, silhouette
and tau
.
validation_kproto(
method = NULL,
object = NULL,
data = NULL,
k = NULL,
lambda = NULL,
kp_obj = "optimal",
...
)
method 
character specifying the validation index: 
object 
Object of class 
data 
Original data; only required if 
k 
Vector specifying the search range for optimum number of clusters; if 
lambda 
Factor to trade off between Euclidean distance of numeric variables and simple matching coefficient between categorical variables. 
kp_obj 
character either "optimal" or "all": Output of the indexoptimal clustering (kp_obj == "optimal") or all computed clusterpartitions (kp_obj == "all"); only required if 
... 
Further arguments passed to

More information about the implemented validation indices:
cindex
Cindex = \frac{S_wS_{min}}{S_{max}S_{min}}
For S_{min}
and S_{max}
it is nessesary to calculate the distances between all pairs of points in the entire data set (\frac{n(n1)}{2}
).
S_{min}
is the sum of the "total number of pairs of objects belonging to the same cluster" smallest distances and
S_{max}
is the sum of the "total number of pairs of objects belonging to the same cluster" largest distances. S_w
is the sum of the withincluster distances.
The minimum value of the index is used to indicate the optimal number of clusters.
dunn
Dunn = \frac{\min_{1 \leq i < j \leq q} d(C_i, C_j)}{\max_{1 \leq k \leq q} diam(C_k)}
The following applies: The dissimilarity between the two clusters C_i
and C_j
is defined as d(C_i, C_j)=\min_{x \in C_i, y \in C_j} d(x,y)
and
the diameter of a cluster is defined as diam(C_k)=\max_{x,y \in C} d(x,y)
.
The maximum value of the index is used to indicate the optimal number of clusters.
gamma
Gamma = \frac{s(+)s()}{s(+)+s()}
Comparisons are made between all withincluster dissimilarities and all betweencluster dissimilarities.
s(+)
is the number of concordant comparisons and s()
is the number of discordant comparisons.
A comparison is named concordant (resp. discordant) if a withincluster dissimilarity is strictly less (resp. strictly greater) than a betweencluster dissimilarity.
The maximum value of the index is used to indicate the optimal number of clusters.
gplus
Gplus = \frac{2 \cdot s()}{\frac{n(n1)}{2} \cdot (\frac{n(n1)}{2}1)}
Comparisons are made between all withincluster dissimilarities and all betweencluster dissimilarities.
s()
is the number of discordant comparisons and a comparison is named discordant if a withincluster
dissimilarity is strictly greater than a betweencluster dissimilarity.
The minimum value of the index is used to indicate the optimal number of clusters.
mcclain
McClain = \frac{\bar{S}_w}{\bar{S}_b}
\bar{S}_w
is the sum of withincluster distances divided by the number of withincluster distances and
\bar{S}_b
is the sum of betweencluster distances divided by the number of betweencluster distances.
The minimum value of the index is used to indicate the optimal number of clusters.
ptbiserial
Ptbiserial = \frac{(\bar{S}_b\bar{S}_w) \cdot (\frac{N_w \cdot N_b}{N_t^2})^{0.5}}{s_d}
\bar{S}_w
is the sum of withincluster distances divided by the number of withincluster distances and
\bar{S}_b
is the sum of betweencluster distances divided by the number of betweencluster distances.
N_t
is the total number of pairs of objects in the data, N_w
is the total number of pairs of
objects belonging to the samecluster and N_b
is the total number of pairs of objects belonging to different clusters.
s_d
is the standard deviation of all distances.
The maximum value of the index is used to indicate the optimal number of clusters.
silhouette
Silhouette = \frac{1}{n} \sum_{i=1}^n \frac{b(i)a(i)}{max(a(i),b(i))}
a(i)
is the average dissimilarity of the ith object to all other objects of the same/own cluster.
b(i)=min(d(i,C))
, where d(i,C)
is the average dissimilarity of the ith object to all the other clusters except the own/same cluster.
The maximum value of the index is used to indicate the optimal number of clusters.
tau
Tau = \frac{s(+)  s()}{((\frac{N_t(N_t1)}{2}t)\frac{N_t(N_t1)}{2})^{0.5}}
Comparisons are made between all withincluster dissimilarities and all betweencluster dissimilarities.
s(+)
is the number of concordant comparisons and s()
is the number of discordant comparisons.
A comparison is named concordant (resp. discordant) if a withincluster dissimilarity is strictly less
(resp. strictly greater) than a betweencluster dissimilarity.
N_t
is the total number of distances \frac{n(n1)}{2}
and t
is the number of comparisons
of two pairs of objects where both pairs represent withincluster comparisons or both pairs are betweencluster
comparisons.
The maximum value of the index is used to indicate the optimal number of clusters.
For computing the optimal number of clusters based on the choosen validation index for kPrototype clustering the output contains:
k_opt 
optimal number of clusters (sampled in case of ambiguity) 
index_opt 
index value of the index optimal clustering 
indices 
calculated indices for 
kp_obj 
if(kp_obj == "optimal") the kproto object of the index optimal clustering and if(kp_obj == "all") all kproto which were calculated 
For computing the indexvalue for a given kPrototype clustering the output contains:
index 
calculated indexvalue 
Rabea Aschenbruck
Aschenbruck, R., Szepannek, G. (2020): Cluster Validation for MixedType Data. Archives of Data Science, Series A, Vol 6, Issue 1. doi: 10.5445/KSP/1000098011/02.
Charrad, M., Ghazzali, N., Boiteau, V., Niknafs, A. (2014): NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set. Journal of Statistical Software, Vol 61, Issue 6. doi: 10.18637/jss.v061.i06.
# generate toy data with factors and numerics
n < 10
prb < 0.99
muk < 2.5
x1 < sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1prb))
x1 < c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1prb, prb)))
x1 < as.factor(x1)
x2 < sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1prb))
x2 < c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1prb, prb)))
x2 < as.factor(x2)
x3 < c(rnorm(n, mean = muk), rnorm(n, mean = muk), rnorm(n, mean = muk), rnorm(n, mean = muk))
x4 < c(rnorm(n, mean = muk), rnorm(n, mean = muk), rnorm(n, mean = muk), rnorm(n, mean = muk))
x < data.frame(x1,x2,x3,x4)
# calculate optimal number of cluster, index values and clusterpartition with Silhouetteindex
val < validation_kproto(method = "silhouette", data = x, k = 3:5, nstart = 5)
# apply kprototypes
kpres < kproto(x, 4, keep.data = TRUE)
# calculate cindexvalue for the given clusterpartition
cindex_value < validation_kproto(method = "cindex", object = kpres)