optimal_dispersion {anticlust} | R Documentation |
Maximize dispersion for K groups
Description
Maximize dispersion for K groups
Usage
optimal_dispersion(x, K, solver = NULL, max_dispersion_considered = NULL)
Arguments
x |
The data input. Can be one of two structures: (1) A
feature matrix where rows correspond to elements and columns
correspond to variables (a single numeric variable can be
passed as a vector). (2) An N x N matrix dissimilarity matrix;
can be an object of class |
K |
The number of groups or a vector describing the size of each group. |
solver |
Optional argument; if passed, has to be either "glpk" or "symphony". See details. |
max_dispersion_considered |
Optional argument used for early stopping. If the dispersion found is equal to or exceeds this value, a solution having the previous best dispersion is returned. |
Details
The dispersion is the minimum distance between two elements
within the same group. This function implements an optimal method
to maximize the dispersion. If the data input x
is a feature
matrix and not a dissimilarity matrix, the pairwise Euclidean
distance is used. It uses the algorithm presented in Max
Diekhoff's Bachelor thesis at the Computer Science Department at
the Heinrich Heine University Düsseldorf.
To find out which items are not allowed to be grouped in the same
cluster for maximum dispersion, the algorithm sequentially builds
instances of a graph coloring problem, using an integer linear
programming (ILP) representation (also see Fernandez et al.,
2013). It is possible to specify the ILP solver via the argument
solver
. This function either requires the R package
Rglpk
and the GNU linear programming kit
(<http://www.gnu.org/software/glpk/>) or the R package
Rsymphony
and the COIN-OR SYMPHONY solver libraries
(<https://github.com/coin-or/SYMPHONY>). If the argument
solver
is not specified, the function will try to find the
GLPK or SYMPHONY solver by itself. It prioritizes using GLPK if
both are available. However, the GNU linear programming kit (solver =
"glpk"
) seems to be considerably slower for K >= 3 than the
SYMPHPONY solver (solver = "symphony"
). So, it is recommended to manually
change the default behaviour.
Optimally solving the maximum dispersion problem is NP-hard for K
> 2 and therefore computationally infeasible for larger data
sets. For K = 3 and K = 4, it seems that this approach scales up to several 100 elements,
or even > 1000 for K = 3 (at least when using the Symphony solver).
For larger data sets, use the heuristic approaches in anticlustering
or
bicriterion_anticlustering
. However, note that for K = 2,
the optimal approach is usually much faster than the heuristics.
In the output, the element edges
defines which elements must be in separate
clusters in order to achieve maximum dispersion. All elements not listed here
can be changed arbitrarily between clusters without reducing the dispersion.
If the maximum possible dispersion corresponds to the minimum dispersion
in the data set, the output elements edges
and groups
are set to
NULL
because all possible groupings have the same value of dispersion.
In this case the output element dispersions_considered
has length 1.
Value
A list with four elements:
dispersion |
The optimal dispersion |
groups |
An assignment of elements to groups (vector) |
edges |
A matrix of 2 columns. Each row contains the indices of elements that had to be investigated to find the dispersion (i.e., each pair of elements cannot be part of the same group in order to achieve maximum dispersion). |
dispersions_considered |
All distances that were tested until the dispersion was found. |
Note
If the SYMPHONY solver is used, an unfortunate "message" is printed to the console when this function terminates:
sym_get_col_solution(): No solution has been stored!
This message is no reason to worry and instead is a direct result
of the algorithm finding the optimal value for the dispersion.
Unfortunately, this message is generated in the C code underlying the
SYMPHONY library (via the printing function printf
), which cannot be
prevented in R.
Author(s)
Max Diekhoff
Martin Papenberg martin.papenberg@hhu.de
References
Diekhoff (2023). Maximizing dispersion for anticlustering. Retrieved from https://www.cs.hhu.de/fileadmin/redaktion/Fakultaeten/Mathematisch-Naturwissenschaftliche_Fakultaet/Informatik/Algorithmische_Bioinformatik/Bachelor-_Masterarbeiten/2831963_ba_ifo_AbschlArbeit_klau_mapap102_madie120_20230203_1815.pdf
Fernández, E., Kalcsics, J., & Nickel, S. (2013). The maximum dispersion problem. Omega, 41(4), 721–730. https://doi.org/10.1016/j.omega.2012.09.005
See Also
dispersion_objective
anticlustering
Examples
N <- 30
M <- 5
K <- 3
data <- matrix(rnorm(N*M), ncol = M)
distances <- dist(data)
opt <- optimal_dispersion(distances, K = K)
opt
# Compare to bicriterion heuristic:
groups_heuristic <- anticlustering(
distances,
K = K,
method = "brusco",
objective = "dispersion",
repetitions = 100
)
c(
OPT = dispersion_objective(distances, opt$groups),
HEURISTIC = dispersion_objective(distances, groups_heuristic)
)
# Different group sizes are possible:
table(optimal_dispersion(distances, K = c(15, 10, 5))$groups)
# Induce cannot-link constraints by maximizing the dispersion:
solvable <- matrix(1, ncol = 6, nrow = 6)
solvable[2, 1] <- -1
solvable[3, 1] <- -1
solvable[4, 1] <- -1
solvable <- as.dist(solvable)
solvable
# An optimal solution has to put item 1 in a different group than
# items 2, 3 and 4 -> this is possible for K = 2
optimal_dispersion(solvable, K = 2)$groups
# It no longer works when item 1 can also not be linked with item 5:
# (check out output!)
unsolvable <- as.matrix(solvable)
unsolvable[5, 1] <- -1
unsolvable <- as.dist(unsolvable)
unsolvable
optimal_dispersion(unsolvable, K = 2)
# But:
optimal_dispersion(unsolvable, K = c(2, 4)) # group sizes, not number of groups