fast_anticlustering {anticlust}R Documentation

Fast anticlustering


The most efficient way to solve anticlustering optimizing the k-means variance criterion with an exchange method. Can be used for very large data sets.


fast_anticlustering(x, K, k_neighbours = Inf, categories = NULL)



A numeric vector, matrix or data.frame of data points. Rows correspond to elements and columns correspond to features. A vector represents a single numeric feature.


How many anticlusters should be created.


The number of neighbours that serve as exchange partner for each element. Defaults to Inf, i.e., each element is exchanged with each element in other groups.


A vector, data.frame or matrix representing one or several categorical constraints.


This function was created to make anticlustering applicable to large data sets (e.g., 100,000 elements). It optimizes the k-means variance objective because computing all pairwise distances is not feasible for many elements. Additionally, this function employs a speed-optimized exchange method. For each element, the potential exchange partners are generated using a nearest neighbor search with the function nn2 from the RANN package. The nearest neighbors then serve as exchange partners. This approach is inspired by the preclustering heuristic according to which good solutions are found when similar elements are in different sets—by swapping nearest neighbors, this will often be the case. The number of exchange partners per element has to be set using the argument k_neighbours; by default, it is set to Inf, meaning that all possible swaps are tested. This default must be changed by the user for large data sets. More exchange partners generally improve the output, but also increase run time.

When setting the categories argument, exchange partners will be generated from the same category. Note that when categories has multiple columns (i.e., each element is assigned to multiple columns), each combination of categories is treated as a distinct category by the exchange method.


Martin Papenberg

See Also




features <- iris[, - 5]

start <- Sys.time()
ac_exchange <- fast_anticlustering(features, K = 3)
Sys.time() - start

## The following call is equivalent to the call above:
start <- Sys.time()
ac_exchange <- anticlustering(features, K = 3, objective = "variance")
Sys.time() - start

## Improve run time by using fewer exchange partners:
start <- Sys.time()
ac_fast <- fast_anticlustering(features, K = 3, k_neighbours = 10)
Sys.time() - start

by(features, ac_exchange, function(x) round(colMeans(x), 2))
by(features, ac_fast, function(x) round(colMeans(x), 2))

[Package anticlust version 0.6.0 Index]