optimal_dispersion {anticlust}R Documentation

Maximize dispersion for K groups

Description

Maximize dispersion for K groups

Usage

optimal_dispersion(x, K, solver = NULL, max_dispersion_considered = NULL)

Arguments

x

The data input. Can be one of two structures: (1) A feature matrix where rows correspond to elements and columns correspond to variables (a single numeric variable can be passed as a vector). (2) An N x N matrix dissimilarity matrix; can be an object of class dist (e.g., returned by dist or as.dist) or a matrix where the entries of the upper and lower triangular matrix represent pairwise dissimilarities.

K

The number of groups or a vector describing the size of each group.

solver

Optional argument; if passed, has to be either "glpk" or "symphony". See details.

max_dispersion_considered

Optional argument used for early stopping. If the dispersion found is equal to or exceeds this value, a solution having the previous best dispersion is returned.

Details

The dispersion is the minimum distance between two elements within the same group. This function implements an optimal method to maximize the dispersion. If the data input x is a feature matrix and not a dissimilarity matrix, the pairwise Euclidean distance is used. It uses the algorithm presented in Max Diekhoff's Bachelor thesis at the Computer Science Department at the Heinrich Heine University Düsseldorf.

To find out which items are not allowed to be grouped in the same cluster for maximum dispersion, the algorithm sequentially builds instances of a graph coloring problem, using an integer linear programming (ILP) representation (also see Fernandez et al., 2013). It is possible to specify the ILP solver via the argument solver. This function either requires the R package Rglpk and the GNU linear programming kit (<http://www.gnu.org/software/glpk/>) or the R package Rsymphony and the COIN-OR SYMPHONY solver libraries (<https://github.com/coin-or/SYMPHONY>). If the argument solver is not specified, the function will try to find the GLPK or SYMPHONY solver by itself. It prioritizes using GLPK if both are available. However, the GNU linear programming kit (solver = "glpk") seems to be considerably slower for K >= 3 than the SYMPHPONY solver (solver = "symphony"). So, it is recommended to manually change the default behaviour.

Optimally solving the maximum dispersion problem is NP-hard for K > 2 and therefore computationally infeasible for larger data sets. For K = 3 and K = 4, it seems that this approach scales up to several 100 elements, or even > 1000 for K = 3 (at least when using the Symphony solver). For larger data sets, use the heuristic approaches in anticlustering or bicriterion_anticlustering. However, note that for K = 2, the optimal approach is usually much faster than the heuristics.

In the output, the element edges defines which elements must be in separate clusters in order to achieve maximum dispersion. All elements not listed here can be changed arbitrarily between clusters without reducing the dispersion. If the maximum possible dispersion corresponds to the minimum dispersion in the data set, the output elements edges and groups are set to NULL because all possible groupings have the same value of dispersion. In this case the output element dispersions_considered has length 1.

Value

A list with four elements:

dispersion

The optimal dispersion

groups

An assignment of elements to groups (vector)

edges

A matrix of 2 columns. Each row contains the indices of elements that had to be investigated to find the dispersion (i.e., each pair of elements cannot be part of the same group in order to achieve maximum dispersion).

dispersions_considered

All distances that were tested until the dispersion was found.

Note

If the SYMPHONY solver is used, an unfortunate "message" is printed to the console when this function terminates:

sym_get_col_solution(): No solution has been stored!

This message is no reason to worry and instead is a direct result of the algorithm finding the optimal value for the dispersion. Unfortunately, this message is generated in the C code underlying the SYMPHONY library (via the printing function printf), which cannot be prevented in R.

Author(s)

Max Diekhoff

Martin Papenberg martin.papenberg@hhu.de

References

Diekhoff (2023). Maximizing dispersion for anticlustering. Retrieved from https://www.cs.hhu.de/fileadmin/redaktion/Fakultaeten/Mathematisch-Naturwissenschaftliche_Fakultaet/Informatik/Algorithmische_Bioinformatik/Bachelor-_Masterarbeiten/2831963_ba_ifo_AbschlArbeit_klau_mapap102_madie120_20230203_1815.pdf

Fernández, E., Kalcsics, J., & Nickel, S. (2013). The maximum dispersion problem. Omega, 41(4), 721–730. https://doi.org/10.1016/j.omega.2012.09.005

See Also

dispersion_objective anticlustering

Examples


N <- 30
M <- 5
K <- 3
data <- matrix(rnorm(N*M), ncol = M)
distances <- dist(data)

opt <- optimal_dispersion(distances, K = K)
opt

# Compare to bicriterion heuristic:
groups_heuristic <- anticlustering(
  distances, 
  K = K,
  method = "brusco", 
  objective = "dispersion", 
  repetitions = 100
)
c(
  OPT = dispersion_objective(distances, opt$groups),
  HEURISTIC = dispersion_objective(distances, groups_heuristic)
)

# Different group sizes are possible:
table(optimal_dispersion(distances, K = c(15, 10, 5))$groups)

# Induce cannot-link constraints by maximizing the dispersion:
solvable <- matrix(1, ncol = 6, nrow = 6)
solvable[2, 1] <- -1
solvable[3, 1] <- -1
solvable[4, 1] <- -1
solvable <- as.dist(solvable)
solvable

# An optimal solution has to put item 1 in a different group than 
# items 2, 3 and 4 -> this is possible for K = 2
optimal_dispersion(solvable, K = 2)$groups

# It no longer works when item 1 can also not be linked with item 5:
# (check out output!)
unsolvable <- as.matrix(solvable)
unsolvable[5, 1] <- -1
unsolvable <- as.dist(unsolvable)
unsolvable
optimal_dispersion(unsolvable, K = 2)
# But:
optimal_dispersion(unsolvable, K = c(2, 4)) # group sizes, not number of groups


[Package anticlust version 0.8.5 Index]