sparsewkm {vimpclust} | R Documentation |
Sparse weighted k-means
Description
This function performs sparse weighted k-means on a set of observations described by numerical and/or categorical variables. It generalizes the sparse clustering algorithm introduced in Witten & Tibshirani (2010) to any type of data (numerical, categorical or a mixture of both). The weights of the variables indicate their importance in the clustering process and discriminant variables are thus selected by means of weights set to 0.
Usage
sparsewkm(
X,
centers,
lambda = NULL,
nlambda = 20,
nstart = 10,
itermaxw = 20,
itermaxkm = 10,
renamelevel = TRUE,
verbose = 1,
epsilonw = 1e-04
)
Arguments
X |
a dataframe of dimension |
centers |
an integer representing the number of clusters. |
lambda |
a vector of numerical values (or a single value) providing
a grid of values for the regularization parameter. If NULL (by default), the function computes its
own lambda sequence of length |
nlambda |
an integer indicating the number of values for the regularization parameter.
By default, |
nstart |
an integer representing the number of random starts in the k-means algorithm.
By default, |
itermaxw |
an integer indicating the maximum number of iterations for the inside
loop over the weights |
itermaxkm |
an integer representing the maximum number of iterations in the k-means
algorithm. By default, |
renamelevel |
a boolean. If TRUE (default option), each level of a categorical variable
is renamed as |
verbose |
an integer value. If |
epsilonw |
a positive numerical value. It provides the precision of the stopping
criterion over |
Details
Sparse weighted k-means performs clustering on mixed data (numerical and/or categorical), and automatically selects the most discriminant variables by setting to zero the weights of the non-discriminant ones.
The mixted data is first preprocessed: numerical variables are scaled to zero mean and unit variance; categorical variables are transformed into dummy variables, and scaled – in mean and variance – with respect to the relative frequency of each level.
The algorithm is based on the optimization of a cost function which is the weighted between-class variance penalized
by a group L1-norm. The groups are implicitely defined: each numerical variable constitutes its own group, the levels
associated to one categorical variable constitute a group. The importance of the penalty term may be adjusted through
the regularization parameter lambda
.
The output of the algorithm is two-folded: one gets a partitioning of the data set and a vector of weights associated
to each variable. Some of the weights are equal to 0, meaning that the associated variables do not participate in the
clustering process. If lambda
is equal to zero, there is no penalty applied to the weighted between-class variance in the
optimization procedure. The larger the value of lambda
, the larger the penalty term and the number of variables with
null weights. Furthemore, the weights associated to each level of a categorical variable are also computed.
Since it is difficult to choose the regularization parameter lambda
without prior knowledge,
the function builds automatically a grid of parameters and finds a partition and vector of weights for each
value of the grid.
Note also that the columns of the data frame X
must be of class factor for
categorical variables.
Value
lambda |
a numerical vector containing the regularization parameters (a grid of values). |
W |
a |
Wm |
a |
cluster |
a |
sel.init.feat |
a numerical vector of the same length as |
sel.trans.feat |
a numerical vector of the same length as |
X.transformed |
a matrix of size |
index |
a numerical vector indexing the variables and allowing to group together the levels of a categorical variable. |
bss.per.feature |
a matrix of size |
References
Witten, D. M., & Tibshirani, R. (2010). A framework for feature selection in clustering. Journal of the American Statistical Association, 105(490), 713-726.
Chavent, M. & Lacaille, J. & Mourer, A. & Olteanu, M. (2020). Sparse k-means for mixed data via group-sparse clustering, ESANN proceedings.
See Also
plot.spwkm
, info_clust
,
groupsparsewkm
, recodmix
Examples
data(HDdata)
out <- sparsewkm(X = HDdata[,-14], centers = 2)
# grid of automatically selected regularization parameters
out$lambda
k <- 10
# weights of the variables for the k-th regularization parameter
out$W[,k]
# weights of the numerical variables and of the levels
out$Wm[,k]
# partitioning obtained for the k-th regularization parameter
out$cluster[,k]
# number of selected variables
out$sel.init.feat
# between-class variance on each variable
out$bss.per.feature[,k]
# between-class variance
sum(out$bss.per.feature[,k])