R: Optimal ("exact") algorithms for anticlustering

optimal_anticlustering {anticlust}

R Documentation

Optimal ("exact") algorithms for anticlustering

Description

Wrapper function that gives access to all optimal algorithms for anticlustering that are available in anticlust.

Usage

optimal_anticlustering(x, K, objective, solver = NULL)

Arguments

`x`	The data input. Can be one of two structures: (1) A feature matrix where rows correspond to elements and columns correspond to variables (a single numeric variable can be passed as a vector). (2) An N x N matrix dissimilarity matrix; can be an object of class `dist` (e.g., returned by `dist` or `as.dist`) or a `matrix` where the entries of the upper and lower triangular matrix represent pairwise dissimilarities.
`K`	How many anticlusters should be created or alternatively: (a) A vector describing the size of each group (the latter currently only works for `objective = "dispersion")`.
`objective`	The anticlustering objective, can be "diversity", "variance", "kplus" or "dispersion".
`solver`	Optional. The solver used to obtain the optimal method. Currently supports "glpk" and "symphony". See details.

Details

This is a wrapper for all optimal methods supported in anticlust (currently and in the future). As compared to anticlustering, it allows to specify the solver to obtain an optimal solution and it can be used to obtain optimal solutions for all supported anticlustering objectives (variance, diversity, k-plus, dispersion). For the objectives "variance", "diversity" and "kplus", the optimal ILP method in Papenberg and Klau (2021) is used, which maximizes the sum of all pairwise intra-cluster distances (given user specified number of clusters, for equal-sized clusters). To employ k-means anticlustering (i.e. set objective = "variance"), the squared Euclidean distance is used. For k-plus anticlustering, the squared Euclidean distance based on the extended k-plus data matrix is used (see kplus_moment_variables). For the diversity (and the dispersion), the Euclidean distance is used by default, but any user-defined dissimilarity matrix is possible.

The dispersion is solved optimal using the approach described in optimal_dispersion.

The optimal methods either require the R package Rglpk and the GNU linear programming kit (<http://www.gnu.org/software/glpk/>), or the R package Rsymphony and the COIN-OR SYMPHONY solver libraries (<https://github.com/coin-or/SYMPHONY>). If the argument solver is not specified by the user, the function will try to find the GLPK or SYMPHONY solver and throw an error if none is available. It will select the GLPK solver if both are available because some rare instances have been observed where the SYMPHONY solver crashes on Mac computers. I would still try out the SYMPHONY solver to see if the unlikely crash occurs. However, this has to be set by the user (at least if both solver packages Rsymphony and Rglpk are available on the system).

Value

A vector of length N that assigns a group (i.e, a number between 1 and K) to each input element.

Author(s)

Martin Papenberg martin.papenberg@hhu.de

Examples


# data <- matrix(rnorm(24), ncol = 2)

# These calls are equivalent for k-means anticlustering:
# optimal_anticlustering(data, K = 2, objective = "variance")
# optimal_anticlustering(dist(data)^2, K = 2, objective = "diversity")

# These calls are equivalent for k-plus anticlustering:
# optimal_anticlustering(data, K = 2, objective = "kplus")
# optimal_anticlustering(dist(kplus_moment_variables(data, 2))^2, K = 2, objective = "diversity")

[Package anticlust version 0.8.5 Index]