kproto {clustMixType}  R Documentation 
Computes kprototypes clustering for mixedtype data.
kproto(x, ...) ## Default S3 method: kproto( x, k, lambda = NULL, iter.max = 100, nstart = 1, na.rm = TRUE, keep.data = TRUE, verbose = TRUE, ... )
x 
Data frame with both numerics and factors. 
... 
Currently not used. 
k 
Either the number of clusters, a vector specifying indices of initial prototypes, or a data frame of prototypes of the same columns as 
lambda 
Parameter > 0 to trade off between Euclidean distance of numeric variables and simple matching coefficient between categorical variables. Also a vector of variable specific factors is possible where the order must correspond to the order of the variables in the data. In this case all variables' distances will be multiplied by their corresponding lambda value. 
iter.max 
Maximum number of iterations if no convergence before. 
nstart 
If > 1 repetetive computations with random initializations are computed and the result with minimum tot.dist is returned. 
na.rm 
A logical value indicating whether NA values should be stripped before the computation proceeds. 
keep.data 
Logical whether original should be included in the returned object. 
verbose 
Logical whether information about the cluster procedure should be given. Caution: If 
The algorithm like kmeans iteratively recomputes cluster prototypes and reassigns clusters.
Clusters are assigned using d(x,y) = d_{euclid}(x,y) + λ d_{simple\,matching}(x,y).
Cluster prototypes are computed as cluster means for numeric variables and modes for factors
(cf. Huang, 1998).
In case of na.rm = FALSE
: for each observation variables with missings are ignored
(i.e. only the remaining variables are considered for distance computation).
In consequence for observations with missings this might result in a change of variable's weighting compared to the one specified
by lambda
. Further note: For these observations distances to the prototypes will typically be smaller as they are based
on fewer variables.
kmeans
like object of class kproto
:
cluster 
Vector of cluster memberships. 
centers 
Data frame of cluster prototypes. 
lambda 
Distance parameter lambda. 
size 
Vector of cluster sizes. 
withinss 
Vector of within cluster distances for each cluster, i.e. summed distances of all observations belonging to a cluster to their respective prototype. 
tot.withinss 
Target function: sum of all observations' distances to their corresponding cluster prototype. 
dists 
Matrix with distances of observations to all cluster prototypes. 
iter 
Prespecified maximum number of iterations. 
trace 
List with two elements (vectors) tracing the iteration process:

Szepannek, G. (2018): clustMixType: UserFriendly Clustering of MixedType Data in R, The R Journal 10/2, 200208, doi: 10.32614/RJ2018048.
Z.Huang (1998): Extensions to the kMeans Algorithm for Clustering Large Data Sets with Categorical Variables, Data Mining and Knowledge Discovery 2, 283304.
# generate toy data with factors and numerics n < 100 prb < 0.9 muk < 1.5 clusid < rep(1:4, each = n) x1 < sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1prb)) x1 < c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1prb, prb))) x1 < as.factor(x1) x2 < sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1prb)) x2 < c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1prb, prb))) x2 < as.factor(x2) x3 < c(rnorm(n, mean = muk), rnorm(n, mean = muk), rnorm(n, mean = muk), rnorm(n, mean = muk)) x4 < c(rnorm(n, mean = muk), rnorm(n, mean = muk), rnorm(n, mean = muk), rnorm(n, mean = muk)) x < data.frame(x1,x2,x3,x4) # apply kprototypes kpres < kproto(x, 4) clprofiles(kpres, x) # in real world clusters are often not as clear cut # by variation of lambda the emphasize is shifted towards factor / numeric variables kpres < kproto(x, 2) clprofiles(kpres, x) kpres < kproto(x, 2, lambda = 0.1) clprofiles(kpres, x) kpres < kproto(x, 2, lambda = 25) clprofiles(kpres, x)