fast_anticlustering {anticlust} | R Documentation |
Fast anticlustering
Description
Anticlustering via optimizing the k-means variance criterion with an
adjusted exchange method where the number of exchange partners can be
specified. Note that this function is no longer the fastest way to solve
anticlustering, because the exchange method used in anticlustering
and kplus_anticlustering
has been reimplemented in C,
while fast_anticlustering
still uses a plain R implementation.
Usage
fast_anticlustering(x, K, k_neighbours = Inf, categories = NULL)
Arguments
x |
A numeric vector, matrix or data.frame of data points. Rows correspond to elements and columns correspond to features. A vector represents a single numeric feature. |
K |
How many anticlusters should be created. |
k_neighbours |
The number of neighbours that serve as exchange partner for each element. Defaults to Inf, i.e., each element is exchanged with each element in other groups. |
categories |
A vector, data.frame or matrix representing one or several categorical constraints. |
Details
This function was created to make anticlustering applicable
to large data sets (e.g., 100,000 elements). It optimizes the k-means
variance objective because computing all pairwise as is done when optimizing
the diversity is not feasible for very large data sets (like for about N > 30000).
Additionally, this function employs a
speed-optimized exchange method. For each element, the potential
exchange partners are generated using a nearest neighbor search with the
function nn2
from the RANN
package. The nearest
neighbors then serve as exchange partners. This approach is inspired by the
preclustering heuristic according to which good solutions are found
when similar elements are in different sets—by swapping nearest
neighbors, this will often be the case. The number of exchange partners
per element has to be set using the argument k_neighbours
; by
default, it is set to Inf
, meaning that all possible swaps are
tested. This default must be changed by the user for large data sets.
More exchange partners generally improve the output, but also increase
run time.
When setting the categories
argument, exchange partners will
be generated from the same category. Note that when
categories
has multiple columns (i.e., each element is
assigned to multiple columns), each combination of categories is
treated as a distinct category by the exchange method.
Note that in the recent versions of anticlust, the function anticlustering
is actually faster than fast_anticlustering
because the exchange method
there has been implemented in C instead of plain R. In most cases it is therefore
not recommended to call fast_anticlustering
, instead use anticlustering
or kplus_anticlustering
.
Author(s)
Martin Papenberg martin.papenberg@hhu.de
See Also
Examples
features <- iris[, - 5]
start <- Sys.time()
ac_exchange <- fast_anticlustering(features, K = 3)
Sys.time() - start
## The following call is equivalent to the call above:
start <- Sys.time()
ac_exchange <- anticlustering(features, K = 3, objective = "variance")
Sys.time() - start
## Improve run time by using fewer exchange partners:
start <- Sys.time()
ac_fast <- fast_anticlustering(features, K = 3, k_neighbours = 10)
Sys.time() - start
by(features, ac_exchange, function(x) round(colMeans(x), 2))
by(features, ac_fast, function(x) round(colMeans(x), 2))