optimal_anticlustering {anticlust} | R Documentation |
Optimal ("exact") algorithms for anticlustering
Description
Wrapper function that gives access to all optimal algorithms for anticlustering that are available in anticlust.
Usage
optimal_anticlustering(x, K, objective, solver = NULL)
Arguments
x |
The data input. Can be one of two structures: (1) A
feature matrix where rows correspond to elements and columns
correspond to variables (a single numeric variable can be
passed as a vector). (2) An N x N matrix dissimilarity matrix;
can be an object of class |
K |
How many anticlusters should be created or alternatively:
(a) A vector describing the size of each group (the latter currently
only works for |
objective |
The anticlustering objective, can be "diversity", "variance", "kplus" or "dispersion". |
solver |
Optional. The solver used to obtain the optimal method. Currently supports "glpk" and "symphony". See details. |
Details
This is a wrapper for all optimal methods supported in anticlust (currently and in the future).
As compared to anticlustering
, it allows to specify the solver to obtain an optimal
solution and it can be used to obtain optimal solutions for all supported
anticlustering objectives (variance, diversity, k-plus, dispersion). For
the objectives "variance", "diversity" and "kplus", the optimal ILP method
in Papenberg and Klau (2021) is used, which maximizes the sum of all pairwise
intra-cluster distances (given user specified number of clusters, for equal-sized clusters).
To employ k-means anticlustering (i.e. set objective = "variance"
), the
squared Euclidean distance is used. For k-plus anticlustering, the squared Euclidean distance
based on the extended k-plus data matrix is used (see kplus_moment_variables
).
For the diversity (and the dispersion), the Euclidean distance is used by default,
but any user-defined dissimilarity matrix is possible.
The dispersion is solved optimal using the approach described in optimal_dispersion
.
The optimal methods either require the R package Rglpk
and the GNU linear programming kit
(<http://www.gnu.org/software/glpk/>), or the R package
Rsymphony
and the COIN-OR SYMPHONY solver libraries
(<https://github.com/coin-or/SYMPHONY>). If the argument solver
is not
specified by the user, the function will try to find the GLPK or SYMPHONY
solver and throw an error if none is available. It will select the
GLPK solver if both are available because some rare instances have been observed where
the SYMPHONY solver crashes on Mac computers. I would still try out the
SYMPHONY solver to see if the unlikely crash occurs. However, this has to be
set by the user (at least if both solver packages Rsymphony and Rglpk are available on the system).
Value
A vector of length N that assigns a group (i.e, a number
between 1 and K
) to each input element.
Author(s)
Martin Papenberg martin.papenberg@hhu.de
Examples
# data <- matrix(rnorm(24), ncol = 2)
# These calls are equivalent for k-means anticlustering:
# optimal_anticlustering(data, K = 2, objective = "variance")
# optimal_anticlustering(dist(data)^2, K = 2, objective = "diversity")
# These calls are equivalent for k-plus anticlustering:
# optimal_anticlustering(data, K = 2, objective = "kplus")
# optimal_anticlustering(dist(kplus_moment_variables(data, 2))^2, K = 2, objective = "diversity")