kproto {clustMixType} | R Documentation |
k-Prototypes Clustering
Description
Computes k-prototypes clustering for mixed-type data.
Usage
kproto(x, ...)
## Default S3 method:
kproto(
x,
k,
lambda = NULL,
type = "huang",
iter.max = 100,
nstart = 1,
na.rm = "yes",
keep.data = TRUE,
verbose = TRUE,
init = NULL,
p_nstart.m = 0.9,
...
)
Arguments
x |
Data frame with both numerics and factors (also ordered factors are possible). |
... |
Currently not used. |
k |
Either the number of clusters, a vector specifying indices of initial prototypes, or a data frame of
prototypes of the same columns as |
lambda |
Parameter > 0 to trade off between Euclidean distance of numeric variables and simple matching
coefficient between categorical variables (if |
type |
Character, to specify the distance for clustering. Either |
iter.max |
Numeric; maximum number of iterations if no convergence before. |
nstart |
Numeric; If > 1 repetitive computations with random initializations are computed and the result with
minimum |
na.rm |
Character, either |
keep.data |
Logical, whether original should be included in the returned object. |
verbose |
Logical, whether additional information about process should be printed.
Caution: For |
init |
Character, to specify the initialization strategy. Either |
p_nstart.m |
Numeric, probability(=0.9 is default) for |
Details
Like k-means, the k-prototypes algorithm iteratively recomputes cluster prototypes and reassigns
clusters, whereby with type = "huang"
clusters are assigned using the distance
d(x,y) = d_{euclid}(x,y) + \lambda d_{simple\,matching}(x,y)
. Cluster prototypes are computed as
cluster means for numeric variables and modes for factors (cf. Huang, 1998). Ordered factors variables
are treated as categorical variables.
For type = "gower"
range-normalized absolute distances from the cluster median are computed for
the numeric variables (and for the ranks of the ordered factors respectively). For factors simple matching
distance is used as in the original k prototypes algorithm. The prototypes are given by the median for
numeric variables, the mode for factors and the level with the closest rank to the median rank of the
corresponding cluster (cf. Szepannek et al., 2024).
In case of na.rm = FALSE
: for each observation variables with missings are ignored (i.e. only the
remaining variables are considered for distance computation). In consequence for observations with missings
this might result in a change of variable's weighting compared to the one specified by lambda
. For
these observations distances to the prototypes will typically be smaller as they are based on fewer variables.
The type
argument also accepts input "standard"
, but this naming convention is deprecated and
has been renamed to "huang"
. Please use "huang"
instead.
Value
kmeans
like object of class kproto
:
cluster |
Vector of cluster memberships. |
centers |
Data frame of cluster prototypes. |
lambda |
Distance parameter lambda. |
size |
Vector of cluster sizes. |
withinss |
Vector of within cluster distances for each cluster, i.e. summed distances of all observations belonging to a cluster to their respective prototype. |
tot.withinss |
Target function: sum of all observations' distances to their corresponding cluster prototype. |
dists |
Matrix with distances of observations to all cluster prototypes. |
iter |
Prespecified maximum number of iterations. |
trace |
List with two elements (vectors) tracing the iteration process:
|
inits |
Initial prototypes determined by specified initialization strategy, if init is either 'nbh.dens' or 'sel.cen'. |
nstart.m |
only for 'init = nstart_m': determined number of randomly choosen sets. |
data |
if 'keep.data = TRUE' than the original data will be added to the output list. |
type |
Type argument of the function call. |
stdization |
Only returned for |
Author(s)
References
Szepannek, G. (2018): clustMixType: User-Friendly Clustering of Mixed-Type Data in R, The R Journal 10/2, 200-208, doi:10.32614/RJ-2018-048.
Aschenbruck, R., Szepannek, G., Wilhelm, A. (2022): Imputation Strategies for Clustering Mixed‑Type Data with Missing Values, Journal of Classification, doi:10.1007/s00357-022-09422-y.
Szepannek, G., Aschenbruck, R., Wilhelm, A. (2024): Clustering Large Mixed-Type Data with Ordinal Variables, Advances in Data Analysis and Classification, doi:10.1007/s11634-024-00595-5.
Z.Huang (1998): Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Variables, Data Mining and Knowledge Discovery 2, 283-304.
Examples
# generate toy data with factors and numerics
n <- 100
prb <- 0.9
muk <- 1.5
clusid <- rep(1:4, each = n)
x1 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb))
x1 <- c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb)))
x1 <- as.factor(x1)
x2 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb))
x2 <- c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb)))
x2 <- as.factor(x2)
x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk))
x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk))
x <- data.frame(x1,x2,x3,x4)
# apply k-prototypes
kpres <- kproto(x, 4)
clprofiles(kpres, x)
# in real world clusters are often not as clear cut
# by variation of lambda the emphasize is shifted towards factor / numeric variables
kpres <- kproto(x, 2)
clprofiles(kpres, x)
kpres <- kproto(x, 2, lambda = 0.1)
clprofiles(kpres, x)
kpres <- kproto(x, 2, lambda = 25)
clprofiles(kpres, x)