groupsparsewkm {vimpclust} | R Documentation |
Group-sparse weighted k-means
Description
This function performs group-sparse weighted k-means on a set of observations described by numerical variables organized in groups. It generalizes the sparse clustering algorithm introduced by Witten & Tibshirani (2010) to groups. While the algorithm clusters the observations, the groups of variables are supposed priorly known. The algorithm computes a series of weights associated to the groups of variables, the weights indicating the importance of each group in the clustering process.
Usage
groupsparsewkm(
X,
centers,
lambda = NULL,
nlambda = 20,
index = 1:ncol(X),
sizegroup = TRUE,
nstart = 10,
itermaxw = 20,
itermaxkm = 10,
scaling = TRUE,
verbose = 1,
epsilonw = 1e-04
)
Arguments
X |
a numerical matrix or a dataframe of dimension |
centers |
an integer representing the number of clusters. |
lambda |
a vector of numerical values (or a single value) providing
a grid of values for the regularization parameter. If NULL (by default), the function computes its
own lambda sequence of length |
nlambda |
an integer indicating the number of values for the regularization parameter.
By default, |
index |
a vector of integers of size |
sizegroup |
a boolean. If TRUE, the group sizes (number of variables in each group) are taken into account in the penalty term (see details).
By default, |
nstart |
an integer representing the number of random starts in the k-means algorithm.
By default, |
itermaxw |
an integer indicating the maximum number of iterations for the inside
loop over the weights |
itermaxkm |
an integer representing the maximum number of iterations in the k-means
algorithm. By default, |
scaling |
a boolean. If TRUE, variables are scaled to zero mean and unit variance. By default, |
verbose |
an integer value. If |
epsilonw |
a positive numerical value. It provides the precision of the stopping criterion over |
Details
Group-sparse weighted k-means performs clustering on data described by numerical variables priorly partitionned into groups, and automatically selects the most discriminant groups by setting to zero the weights of the non-discriminant ones.
The algorithm is based on the optimization of a cost function which is the weighted between-class variance penalized by a group L1-norm. The groups must be priorly defined through
expert knowledge. If there is no group structure (each group contains one variable only), the algorithm reduces to the sparse weighted k-means introduced in Witten & Tibshirani (2010).
The penalty term may take into account the size of the groups by setting sizegroup=TRUE
(see Chavent et al. (2020) for further details on the mathematical expression of the
optimized criterion). The importance of the penalty term may be adjusted through the regularization parameter lambda
. If lambda=0
, there is no penalty applied to the
weighted between-class variance. The larger lambda
, the larger the penalty term and the number of groups with null weights.
The output of the algorithm is three-folded: one gets a partitioning of the data, a vector of weights associated to each group, and a vector of weights associated to each variable. Weights equal to zero imply that the associated variables or the associated groups do not participate in the clustering process.
Since it is difficult to chose the regularization parameter lambda
without prior knowledge, the function builds automatically a grid of parameters and finds the partitioning
and the vectors of weights associated to each value in the grid.
Note that when the regularization parameter is equal to 0 (no penalty applied), the output is different from that of a regular k-means, since the optimized criterion is a weighted between-class variance and not the between-class variance only.
Value
lambda |
a numerical vector containing the regularization parameters (a grid of values). |
W |
a |
Wg |
a |
cluster |
a |
sel.feat |
a numerical vector of the same length as |
sel.groups |
a numerical vector of the same length as |
Z |
a matrix of size |
bss.per.feature |
a matrix of size |
References
Witten, D. M., & Tibshirani, R. (2010). A framework for feature selection in clustering. Journal of the American Statistical Association, 105(490), p.713-726.
Chavent, M. & Lacaille, J. & Mourer, A. & Olteanu, M. (2020). Sparse k-means for mixed data via group-sparse clustering, ESANN proceedings.
See Also
Examples
data(iris)
# define two groups of variables:
# "Sepal.Length" and "Sepal.Width" in group 1
# "Petal.Length" and "Petal.Width" in group 2
index <- c(1, 2, 1, 2)
# group-sparse k-means
out <- groupsparsewkm(X = iris[,-5], centers = 3, index = index)
# grid of regularization parameters
out$lambda
k <- 10
# weights of the variables for the k-th regularization parameter
out$W[,k]
# weights of the groups for the k-th regularization parameter
out$Wg[,k]
# partition obtained with for the k-th regularization parameter
out$cluster[,k]
# between-class variance on each variable
out$bss.per.feature[,k]
# between-class variance
sum(out$bss.per.feature[,k])/length(index)
# one variable per group (equivalent to sparse k-means)
index <- 1:4 # default option in groupsparsewkm
# sparse k-means
out <- groupsparsewkm(X = iris[,-5], centers = 3, index = index)
# or
out <- groupsparsewkm(X = iris[,-5], centers = 3)
# group weights and variable weights are identical in this case
out$Wg
out$W