anticlustering {anticlust} | R Documentation |

Create groups of elements (anticlusters) that are as similar as possible to each other, by maximizing the heterogeneity within groups. Implements anticlustering algorithms as described in Papenberg and Klau (2020; <doi:10.1037/met0000301>).

anticlustering( x, K, objective = "diversity", method = "exchange", preclustering = FALSE, categories = NULL, repetitions = NULL, standardize = FALSE )

`x` |
The data input. Can be one of two structures: (1) A
feature matrix where rows correspond to elements and columns
correspond to variables (a single numeric variable can be
passed as a vector). (2) An N x N matrix dissimilarity matrix;
can be an object of class |

`K` |
How many anticlusters should be created. Alternatively: (a)
A vector describing the size of each group, or (b) a
vector of length |

`objective` |
The objective to be maximized. The options "diversity" (default; previously called "distance", which is still supported), "variance", "kplus" and "dispersion" are natively supported. May also be a user-defined function object that computes an objective value given a clustering. See Details. |

`method` |
One of "exchange" (default) , "local-maximum", or "ilp". See Details. |

`preclustering` |
Boolean. Should a preclustering be conducted
before anticlusters are created? Defaults to |

`categories` |
A vector, data.frame or matrix representing one or several categorical constraints. See Details. |

`repetitions` |
The number of times a new exchange procedure is
initiated when |

`standardize` |
Boolean. If |

This function is used to solve anticlustering. That is, `K`

groups are created in such a way that all groups are as similar as
possible (this usually corresponds to creating groups with high
within-group heterogeneity). This is accomplished by maximizing
instead of minimizing a clustering objective function. The
maximization of four clustering objective functions is natively
supported (other functions can also defined by the user as
described below):

cluster editing 'diversity' objective, setting

`objective = "diversity"`

(default)k-means 'variance' objective, setting

`objective = "variance"`

k-plus objective, an extension of the k-means variance criterion, setting

`objective = "kplus"`

'dispersion' objective, the minimum distance between any two elements within the same cluster.

The k-means objective is the within-group variance—that is, the
sum of the squared distances between each element and its cluster
center (see `variance_objective`

). K-means
anticlustering focuses on minimizing differences with regard to the
means of the input variables `x`

; k-plus `objective = "kplus"`

anticlustering is an extension of this criterion that also tries to
minimize differences with regard to the standard deviations between
groups (see `kplus_objective`

).

The cluster editing "diversity" objective is the sum of pairwise
distances within groups (see
`diversity_objective`

). Anticluster editing is also
known as the »maximum diverse grouping problem« because it
maximizes group diversity as measured by the sum of pairwise
distances. Hence, anticlustering maximizes between-group similarity
by maximizing within-group heterogeneity. In previous versions of
this package, `method = "distance"`

was used (and is still
supported) to request anticluster editing, but now ```
method =
"diversity"
```

is preferred because there are several clustering
objectives based on pairwise distances (e.g., see
`dispersion_objective`

).

The "dispersion" is the minimum distance between any two elements within the same cluster; applications that require high within-group heterogeneity often require to maximize the dispersion.

If the data input `x`

is a feature matrix (that is: each row
is a "case" and each column is a "variable") and the option
`objective = "diversity"`

is used, the Euclidean distance is
computed as the basic unit of the anticluster editing objective. If
a different measure of dissimilarity is preferred, you may pass a
self-generated dissimiliarity matrix via the argument `x`

.

In the standard case, groups of equal size are generated. Adjust
the argument `K`

to create groups of different size (see
examples).

**Heuristic anticlustering**

By default, a heuristic method is employed for anticlustering: the
exchange method (`method = "exchange"`

). Building on an
initial assignment of elements to anticlusters, elements are
sequentially swapped between anticlusters in such a way that each
swap improves set similarity by the largest amount that is
possible. In the default case, elements are randomly assigned to
anticlusters before the exchange procedure starts; however, it is
also possible to explicitly specify the initial assignment using
the argument `K`

(in this case, `K`

has length
`nrow(x)`

). The exchange procedure is repeated for each
element. Because each possible swap is investigated for each
element, the total number of exchanges grows quadratically with
input size, rendering the exchange method unsuitable for large N.
When using `method = "local-maximum"`

, the exchange method is
repeated until an local maximum is reached. That means after the
exchange process has been conducted once for each data point, the
algorithm restarts with the first element and proceeds to conduct
exchanges until the objective cannot be improved.

When setting `preclustering = TRUE`

, only the `K - 1`

most similar elements serve as exchange partners, which can
dramatically speed up the optimization (more information on the
preclustering option is included below). This option is recommended
for larger N. For very large N, check out the function
`fast_anticlustering`

that was specifically implemented
to process very large data sets.

**Exact anticlustering**

An optimal anticluster editing objective can be found via integer
linear programming (the integer linear program implemented here can
be found in Papenberg & Klau, 2020, (8) - (12)). To this end, set
`method = "ilp"`

. To obtain an optimal solution, the open
source GNU linear programming kit (available from
https://www.gnu.org/software/glpk/glpk.html) and the R package
`Rglpk`

must be installed. The optimal solution is retrieved
by setting `objective = "diversity"`

, `method = "ilp"`

and `preclustering = FALSE`

. Use this combination of arguments
only for small problem sizes.

To relax the optimality requirement, it is possible to set the
argument `preclustering = TRUE`

. In this case, the anticluster
editing objective is still optimized using integer linear
programming, but a preprocessing forbids very similar elements to
be assigned to the same anticluster. The preclustering reduces the
size of the solution space, making the integer linear programming
approach applicable for larger problem instances. With
preclustering, optimality is no longer guaranteed, but the solution
is usually optimal or very close to optimal.

The variance criterion cannot be optimized to optimality using
integer linear programming because the k-means objective function
is not linear. However, it is possible to employ the function
`generate_partitions`

to obtain optimal solutions for
small problem instances.

**Preclustering**

A useful heuristic for anticlustering is to form small groups of
very similar elements and assign these to different groups. This
logic is used as a preprocessing when setting ```
preclustering =
TRUE
```

. That is, before the anticlustering objective is optimized, a
cluster analysis identifies small groups of similar elements (pairs
if `K = 2`

, triplets if `K = 3`

, and so forth). The
optimization of the anticlustering objective is then conducted
under the constraint that these matched elements cannot be assigned
to the same group. When using the exchange algorithm, preclustering
is conducted using a call to `matching`

. When using
`method = "ilp"`

, the preclustering optimally finds groups of
minimum pairwise distance by solving the integer linear program
described in Papenberg and Klau (2020; (8) - (10), (12) - (13)).

**Categorical constraints**

The argument `categories`

may induce categorical constraints.
The grouping variables indicated by `categories`

will be
balanced out across anticlusters. Currently, this functionality is
only available in combination with the heuristic methods, but not
with the exact integer linear programming approach.

**Optimize a custom objective function**

It is possible to pass a `function`

to the argument
`objective`

. See `dispersion_objective`

for an
example. If `objective`

is a function, the exchange method
assigns elements to anticlusters in such a way that the return
value of the custom function is maximized (hence, the function
should return larger values when the between-group similarity is
higher). The custom function has to take two arguments: the first
is the data argument, the second is the clustering assignment. That
is, the argument `x`

will be passed down to the user-defined
function as first argument. **However, only after**
`as.matrix`

has been called on `x`

. This implies
that in the function body, columns of the data set cannot be
accessed using `data.frame`

operations such as
`$`

. Objects of class `dist`

will be converted to matrix
as well.

A vector of length N that assigns a group (i.e, a number
between 1 and `K`

) to each input element.

Martin Papenberg martin.papenberg@hhu.de

Grötschel, M., & Wakabayashi, Y. (1989). A cutting plane algorithm for a clustering problem. Mathematical Programming, 45, 59-96.

Papenberg, M., & Klau, G. W. (2020). Using anticlustering to partition data sets into equivalent parts. Psychological Methods. Advance Online Publication. https://doi.org/10.1037/met0000301.

Späth, H. (1986). Anticlustering: Maximizing the variance criterion. Control and Cybernetics, 15, 213-218.

# Optimize the cluster editing (diversity) criterion anticlusters <- anticlustering( schaper2019[, 3:6], K = 3, categories = schaper2019$room ) # Compare feature means by anticluster by(schaper2019[, 3:6], anticlusters, function(x) round(colMeans(x), 2)) # Compare standard deviations by anticluster by(schaper2019[, 3:6], anticlusters, function(x) round(apply(x, 2, sd), 2)) # check that the "room" is balanced across anticlusters: table(anticlusters, schaper2019$room) # Use multiple starts of the algorithm to improve the objective and # optimize the k-means criterion ("variance") anticlusters <- anticlustering( schaper2019[, 3:6], objective = "variance", K = 3, categories = schaper2019$room, method = "local-maximum", repetitions = 2 ) # Compare means and standard deviations by anticluster by(schaper2019[, 3:6], anticlusters, function(x) round(colMeans(x), 2)) by(schaper2019[, 3:6], anticlusters, function(x) round(apply(x, 2, sd), 2)) # Use different group sizes and optimize the extended k-means # criterion ("kplus") anticlusters <- anticlustering( schaper2019[, 3:6], objective = "kplus", K = c(24, 24, 48), categories = schaper2019$room, repetitions = 10, method = "local-maximum", standardize = TRUE ) table(anticlusters, schaper2019$room) # Compare means and standard deviations by anticluster by(schaper2019[, 3:6], anticlusters, function(x) round(colMeans(x), 2)) by(schaper2019[, 3:6], anticlusters, function(x) round(apply(x, 2, sd), 2)) ## Use preclustering and variance (k-means) criterion on large data set N <- 1000 K = 2 lds <- data.frame(f1 = rnorm(N), f2 = rnorm(N)) ac <- anticlustering( lds, K = K, objective = "variance", preclustering = TRUE ) # The following is equivalent to setting `preclustering = TRUE`: cl <- balanced_clustering(lds, K = N / K) ac <- anticlustering( lds, K = K, objective = "variance", categories = cl )

[Package *anticlust* version 0.5.6 Index]