R: Group-sparse weighted k-means

groupsparsewkm {vimpclust}

R Documentation

Group-sparse weighted k-means

Description

This function performs group-sparse weighted k-means on a set of observations described by numerical variables organized in groups. It generalizes the sparse clustering algorithm introduced by Witten & Tibshirani (2010) to groups. While the algorithm clusters the observations, the groups of variables are supposed priorly known. The algorithm computes a series of weights associated to the groups of variables, the weights indicating the importance of each group in the clustering process.

Usage

groupsparsewkm(
  X,
  centers,
  lambda = NULL,
  nlambda = 20,
  index = 1:ncol(X),
  sizegroup = TRUE,
  nstart = 10,
  itermaxw = 20,
  itermaxkm = 10,
  scaling = TRUE,
  verbose = 1,
  epsilonw = 1e-04
)

Arguments

`X`	a numerical matrix or a dataframe of dimension `n` (observations) by `p` (variables).
`centers`	an integer representing the number of clusters.
`lambda`	a vector of numerical values (or a single value) providing a grid of values for the regularization parameter. If NULL (by default), the function computes its own lambda sequence of length `nlambda` (see details).
`nlambda`	an integer indicating the number of values for the regularization parameter. By default, `nlambda=20`.
`index`	a vector of integers of size `p` providing the group membership for each variable. By default, `index=1:ncol(X)` i.e. no groups or groups of size 1.
`sizegroup`	a boolean. If TRUE, the group sizes (number of variables in each group) are taken into account in the penalty term (see details). By default, `sizegroup=TRUE`.
`nstart`	an integer representing the number of random starts in the k-means algorithm. By default, `nstart=10`.
`itermaxw`	an integer indicating the maximum number of iterations for the inside loop over the weights `w`. By default, `itermaxw=20`.
`itermaxkm`	an integer representing the maximum number of iterations in the k-means algorithm. By default, `itermaxkm=10`.
`scaling`	a boolean. If TRUE, variables are scaled to zero mean and unit variance. By default, `scaling=TRUE`.
`verbose`	an integer value. If `verbose=0`, the function stays silent, if `verbose=1` (default option), it prints whether the stopping criterion over the weights `w` is satisfied.
`epsilonw`	a positive numerical value. It provides the precision of the stopping criterion over `w`. By default, `epsilonw =1e-04`.

Details

Group-sparse weighted k-means performs clustering on data described by numerical variables priorly partitionned into groups, and automatically selects the most discriminant groups by setting to zero the weights of the non-discriminant ones.

The algorithm is based on the optimization of a cost function which is the weighted between-class variance penalized by a group L1-norm. The groups must be priorly defined through expert knowledge. If there is no group structure (each group contains one variable only), the algorithm reduces to the sparse weighted k-means introduced in Witten & Tibshirani (2010). The penalty term may take into account the size of the groups by setting sizegroup=TRUE (see Chavent et al. (2020) for further details on the mathematical expression of the optimized criterion). The importance of the penalty term may be adjusted through the regularization parameter lambda. If lambda=0, there is no penalty applied to the weighted between-class variance. The larger lambda, the larger the penalty term and the number of groups with null weights.

The output of the algorithm is three-folded: one gets a partitioning of the data, a vector of weights associated to each group, and a vector of weights associated to each variable. Weights equal to zero imply that the associated variables or the associated groups do not participate in the clustering process.

Since it is difficult to chose the regularization parameter lambda without prior knowledge, the function builds automatically a grid of parameters and finds the partitioning and the vectors of weights associated to each value in the grid.

Note that when the regularization parameter is equal to 0 (no penalty applied), the output is different from that of a regular k-means, since the optimized criterion is a weighted between-class variance and not the between-class variance only.

Value

`lambda`	a numerical vector containing the regularization parameters (a grid of values).
`W`	a `p` by `length(lambda)` numerical matrix. It contains the weights associated to each variable.
`Wg`	a `L` by `length(lambda)` numerical matrix, where `L` is the number of groups. It contains the weights associated to each group.
`cluster`	a `n` by `length(lambda)` integer matrix. It contains the cluster memberships, for each value of the regularization parameter.
`sel.feat`	a numerical vector of the same length as `lambda`, giving the number of selected variables for each value of the regularization parameter.
`sel.groups`	a numerical vector of the same length as `lambda`, giving the number of selected groups of variables for each value of the regularization parameter.
`Z`	a matrix of size `n` by `p` containing the scaled data if `scaling=TRUE`, and a copy of `X` otherwise.
`bss.per.feature`	a matrix of size `p` by `length(lambda)`. It contains the between-class variance computed for each variable.

References

Witten, D. M., & Tibshirani, R. (2010). A framework for feature selection in clustering. Journal of the American Statistical Association, 105(490), p.713-726.

Chavent, M. & Lacaille, J. & Mourer, A. & Olteanu, M. (2020). Sparse k-means for mixed data via group-sparse clustering, ESANN proceedings.

Examples

data(iris)
# define two groups of variables: 
# "Sepal.Length" and "Sepal.Width" in group 1
# "Petal.Length" and "Petal.Width"  in group 2
index <- c(1, 2, 1, 2)
# group-sparse k-means

out <- groupsparsewkm(X = iris[,-5], centers = 3, index = index)
# grid of regularization parameters
out$lambda
k <- 10
# weights of the variables for the k-th regularization parameter
out$W[,k]
# weights of the groups for the k-th regularization parameter
out$Wg[,k]
# partition obtained with for the k-th regularization parameter
out$cluster[,k]
# between-class variance on each variable
out$bss.per.feature[,k]
# between-class variance 
sum(out$bss.per.feature[,k])/length(index)

# one variable per group (equivalent to sparse k-means)
index <- 1:4 # default option in groupsparsewkm
# sparse k-means
out <- groupsparsewkm(X = iris[,-5], centers = 3, index = index)
# or
out <- groupsparsewkm(X = iris[,-5], centers = 3)
# group weights and variable weights are identical in this case
out$Wg 
out$W

[Package vimpclust version 0.1.0 Index]