fast_anticlustering {anticlust}R Documentation

Fast anticlustering

Description

Anticlustering via optimizing the k-means variance criterion with an adjusted exchange method where the number of exchange partners can be specified. Note that this function is no longer the fastest way to solve anticlustering, because the exchange method used in anticlustering and kplus_anticlustering has been reimplemented in C, while fast_anticlustering still uses a plain R implementation.

Usage

fast_anticlustering(x, K, k_neighbours = Inf, categories = NULL)

Arguments

x

A numeric vector, matrix or data.frame of data points. Rows correspond to elements and columns correspond to features. A vector represents a single numeric feature.

K

How many anticlusters should be created.

k_neighbours

The number of neighbours that serve as exchange partner for each element. Defaults to Inf, i.e., each element is exchanged with each element in other groups.

categories

A vector, data.frame or matrix representing one or several categorical constraints.

Details

This function was created to make anticlustering applicable to large data sets (e.g., 100,000 elements). It optimizes the k-means variance objective because computing all pairwise as is done when optimizing the diversity is not feasible for very large data sets (like for about N > 30000). Additionally, this function employs a speed-optimized exchange method. For each element, the potential exchange partners are generated using a nearest neighbor search with the function nn2 from the RANN package. The nearest neighbors then serve as exchange partners. This approach is inspired by the preclustering heuristic according to which good solutions are found when similar elements are in different sets—by swapping nearest neighbors, this will often be the case. The number of exchange partners per element has to be set using the argument k_neighbours; by default, it is set to Inf, meaning that all possible swaps are tested. This default must be changed by the user for large data sets. More exchange partners generally improve the output, but also increase run time.

When setting the categories argument, exchange partners will be generated from the same category. Note that when categories has multiple columns (i.e., each element is assigned to multiple columns), each combination of categories is treated as a distinct category by the exchange method.

Note that in the recent versions of anticlust, the function anticlustering is actually faster than fast_anticlustering because the exchange method there has been implemented in C instead of plain R. In most cases it is therefore not recommended to call fast_anticlustering, instead use anticlustering or kplus_anticlustering.

Author(s)

Martin Papenberg martin.papenberg@hhu.de

See Also

anticlustering

variance_objective

Examples



features <- iris[, - 5]

start <- Sys.time()
ac_exchange <- fast_anticlustering(features, K = 3)
Sys.time() - start

## The following call is equivalent to the call above:
start <- Sys.time()
ac_exchange <- anticlustering(features, K = 3, objective = "variance")
Sys.time() - start

## Improve run time by using fewer exchange partners:
start <- Sys.time()
ac_fast <- fast_anticlustering(features, K = 3, k_neighbours = 10)
Sys.time() - start

by(features, ac_exchange, function(x) round(colMeans(x), 2))
by(features, ac_fast, function(x) round(colMeans(x), 2))


[Package anticlust version 0.8.1 Index]