anticlustering {anticlust}R Documentation

Anticlustering

Description

Partition a pool of elements into groups (i.e., anticlusters) with the aim of creating high within-group heterogeneity and high between-group similarity. Anticlustering is accomplished by maximizing instead of minimizing a clustering objective function. Implements anticlustering algorithms as described in Papenberg and Klau (2021; <doi:10.1037/met0000301>) and an implementation of the bicriterion anticlustering algorithm described by Brusco, Cradit and Steinley (2020; <doi:10.1111/bmsp.12186>).

Usage

anticlustering(
  x,
  K,
  objective = "diversity",
  method = "exchange",
  preclustering = FALSE,
  categories = NULL,
  repetitions = NULL,
  standardize = FALSE
)

Arguments

x

The data input. Can be one of two structures: (1) A feature matrix where rows correspond to elements and columns correspond to variables (a single numeric variable can be passed as a vector). (2) An N x N matrix dissimilarity matrix; can be an object of class dist (e.g., returned by dist or as.dist) or a matrix where the entries of the upper and lower triangular matrix represent pairwise dissimilarities.

K

How many anticlusters should be created. Alternatively: (a) A vector describing the size of each group, or (b) a vector of length nrow(x) describing how elements are assigned to anticlusters before the optimization starts.

objective

The objective to be maximized. The options "diversity" (default; previously called "distance", which is still supported), "variance", "kplus" and "dispersion" are natively supported. May also be a user-defined function. See Details.

method

One of "exchange" (default) , "local-maximum", "brusco", or "ilp". See Details.

preclustering

Boolean. Should a preclustering be conducted before anticlusters are created? Defaults to FALSE. See Details.

categories

A vector, data.frame or matrix representing one or several categorical constraints. See Details.

repetitions

The number of times a search heuristic is initiated when using method = "exchange", method = "local-maximum", or method = "brusco". In the end, the best objective found across the repetitions is returned.

standardize

Boolean. If TRUE and x is a feature matrix, the data is standardized through a call to scale before the optimization starts. This argument is silently ignored if x is a distance matrix.

Details

This function is used to solve anticlustering. That is, the data input is divided into K in such a way that elements within groups are heterogeneous and the different groups are similar. Anticlustering is accomplished by maximizing instead of minimizing a clustering objective function. The maximization of four clustering objective functions is natively supported (other functions can also defined by the user as described below):

The k-means objective is the within-group variance—that is, the sum of the squared distances between each element and its cluster center (see variance_objective). K-means anticlustering focuses on minimizing differences with regard to the means of the input variables x; k-plus objective = "kplus" anticlustering is an extension of this criterion that also tries to minimize differences with regard to the standard deviations between groups (see kplus_objective).

The cluster editing "diversity" objective is the sum of pairwise distances of elements within the same groups (see diversity_objective). Anticluster editing is also known as the »maximum diverse grouping problem« because it maximizes group diversity as measured by the sum of pairwise distances. Hence, anticlustering maximizes between-group similarity by maximizing within-group heterogeneity. In previous versions of this package, method = "distance" was used (and is still supported) to request anticluster editing, but now method = "diversity" is preferred because there are several clustering objectives based on pairwise distances (e.g., see dispersion_objective).

The "dispersion" is the minimum distance between any two elements that are part of the same cluster; maximization of this objective ensures that any two elements within the same group are as dissimilar from each other as possible. Applications that require high within-group heterogeneity often require to maximize the dispersion.

If the data input x is a feature matrix (that is: each row is a "case" and each column is a "variable") and the option objective = "diversity" is used, the Euclidean distance is computed as the basic unit of the anticluster editing objective. If a different measure of dissimilarity is preferred, you may pass a self-generated dissimiliarity matrix via the argument x.

In the standard case, groups of equal size are generated. Adjust the argument K to create groups of different size (see examples).

Algorithms for anticlustering

By default, a heuristic method is employed for anticlustering: the exchange method (method = "exchange"). First, anticlusters are randomly assigned to clusters. (It is also possible to explicitly specify the initial assignment using the argument K; in this case, K has length nrow(x).) Based on the initial assignment, elements are systematically swapped between anticlusters in such a way that each swap improves the objective value. For an element, each possible swap with elements in other clusters is simulated; then, the one swap is performed that improves the objective the most, but a swap is only conducted if there is an improvement at all. This swapping procedure is repeated for each element. When using method = "local-maximum", the exchange method does not terminate after the first iteration over all elements; instead, the swapping continues until a local maximum is reached. This means that after the exchange process has been conducted once for each data point, the algorithm restarts with the first element and proceeds to conduct exchanges until the objective cannot be improved.

When setting preclustering = TRUE, only the K - 1 most similar elements serve as exchange partners for each element, which can dramatically speed up the optimization (more information on the preclustering heuristic follows below). This option is recommended for larger N.

Using method = "brusco" implements the local bicriterion iterated local search heuristic by Brusco et al. (2020), but only optimizes for one criterion - diversity or dispersion. bicriterion_anticlustering) can be used to conduct the actual bicriterion optimization that is offered by Brusco et al.'s algorithm.

Optimal anticlustering

An optimal anticluster editing objective (i.e., a solution having the global maximum value in diversity) can be found via integer linear programming. The integer linear program implemented here can be found in Papenberg & Klau, (2021; (8) - (12)). To this end, set method = "ilp". To obtain an optimal solution, the open source GNU linear programming kit (available from https://www.gnu.org/software/glpk/glpk.html) and the R package Rglpk must be installed. The optimal solution is retrieved by setting objective = "diversity", method = "ilp" and preclustering = FALSE. Use this combination of arguments only for small problem sizes.

To relax the optimality requirement, it is possible to set the argument preclustering = TRUE. In this case, the anticluster editing objective is still optimized using integer linear programming, but a preprocessing forbids very similar elements to be assigned to the same anticluster. The preclustering reduces the size of the solution space, making the integer linear programming approach applicable for larger problem instances. With preclustering, optimality is no longer guaranteed, but the solution is usually optimal or very close to optimal.

The variance and dispersion criterion cannot be optimized to optimality using integer linear programming because the objective functions are not linear. However, it is possible to employ the function generate_partitions to obtain optimal solutions for small problem instances.

Preclustering

A useful heuristic for anticlustering is to form small groups of very similar elements and assign these to different groups. This logic is used as a preprocessing when setting preclustering = TRUE. That is, before the anticlustering objective is optimized, a cluster analysis identifies small groups of similar elements (pairs if K = 2, triplets if K = 3, and so forth). The optimization of the anticlustering objective is then conducted under the constraint that these matched elements cannot be assigned to the same group. When using the exchange algorithm, preclustering is conducted using a call to matching. When using method = "ilp", the preclustering optimally finds groups of minimum pairwise distance by solving the integer linear program described in Papenberg and Klau (2021; (8) - (10), (12) - (13)).

Categorical constraints

The argument categories may induce categorical constraints. The grouping variables indicated by categories will be balanced out across anticlusters. This functionality is only available for the classical exchange procedures, that is, for method = "exchange" and method = "local-maximum".

Optimize a custom objective function

It is possible to pass a function to the argument objective. See dispersion_objective for an example. If objective is a function, the exchange method assigns elements to anticlusters in such a way that the return value of the custom function is maximized (hence, the function should return larger values when the between-group similarity is higher). The custom function has to take two arguments: the first is the data argument, the second is the clustering assignment. That is, the argument x will be passed down to the user-defined function as first argument. However, only after as.matrix has been called on x. This implies that in the function body, columns of the data set cannot be accessed using data.frame operations such as $. Objects of class dist will be converted to matrix as well.

Value

A vector of length N that assigns a group (i.e, a number between 1 and K) to each input element.

Author(s)

Martin Papenberg martin.papenberg@hhu.de

References

Brusco, M. J., Cradit, J. D., & Steinley, D. (2020). Combining diversity and dispersion criteria for anticlustering: A bicriterion approach. British Journal of Mathematical and Statistical Psychology, 73, 275-396. https://doi.org/10.1111/bmsp.12186

Grötschel, M., & Wakabayashi, Y. (1989). A cutting plane algorithm for a clustering problem. Mathematical Programming, 45, 59-96.

Papenberg, M., & Klau, G. W. (2021). Using anticlustering to partition data sets into equivalent parts. Psychological Methods, 26(2), 161–174. https://doi.org/10.1037/met0000301.

Späth, H. (1986). Anticlustering: Maximizing the variance criterion. Control and Cybernetics, 15, 213-218.

Examples


# Optimize the cluster editing (diversity) criterion
anticlusters <- anticlustering(
  schaper2019[, 3:6],
  K = 3,
  categories = schaper2019$room
)
# Compare feature means by anticluster
by(schaper2019[, 3:6], anticlusters, function(x) round(colMeans(x), 2))
# Compare standard deviations by anticluster
by(schaper2019[, 3:6], anticlusters, function(x) round(apply(x, 2, sd), 2))
# check that the "room" is balanced across anticlusters:
table(anticlusters, schaper2019$room)

# Use multiple starts of the algorithm to improve the objective and
# optimize the k-means criterion ("variance")
anticlusters <- anticlustering(
  schaper2019[, 3:6],
  objective = "variance",
  K = 3,
  categories = schaper2019$room,
  method = "local-maximum",
  repetitions = 2
)
# Compare means and standard deviations by anticluster
by(schaper2019[, 3:6], anticlusters, function(x) round(colMeans(x), 2))
by(schaper2019[, 3:6], anticlusters, function(x) round(apply(x, 2, sd), 2))

# Use different group sizes and optimize the extended k-means
# criterion ("kplus")
anticlusters <- anticlustering(
  schaper2019[, 3:6],
  objective = "kplus",
  K = c(24, 24, 48),
  categories = schaper2019$room,
  repetitions = 10,
  method = "local-maximum",
  standardize = TRUE
)

table(anticlusters, schaper2019$room)
# Compare means and standard deviations by anticluster
by(schaper2019[, 3:6], anticlusters, function(x) round(colMeans(x), 2))
by(schaper2019[, 3:6], anticlusters, function(x) round(apply(x, 2, sd), 2))


## Use preclustering and variance (k-means) criterion on large data set
N <- 1000
K = 2
lds <- data.frame(f1 = rnorm(N), f2 = rnorm(N))
ac <- anticlustering(
  lds, 
  K = K,
  objective = "variance",
  preclustering = TRUE
)

# The following is equivalent to setting `preclustering = TRUE`:
cl <- balanced_clustering(lds, K = N / K)
ac <- anticlustering(
  lds, 
  K = K,
  objective = "variance",
  categories = cl
)


[Package anticlust version 0.6.0 Index]