tkmeans {tclust}R Documentation

TKMEANS method for robust K-means clustering

Description

This function searches for k (or less) spherical clusters in a data matrix x, whereas the ceiling(alpha n) most outlying observations are trimmed.

Usage

tkmeans(
  x,
  k,
  alpha = 0.05,
  nstart = 500,
  niter1 = 3,
  niter2 = 20,
  nkeep = 5,
  iter.max,
  points = NULL,
  center = FALSE,
  scale = FALSE,
  store_x = TRUE,
  parallel = FALSE,
  n.cores = -1,
  zero_tol = 1e-16,
  drop.empty.clust = TRUE,
  trace = 0
)

Arguments

x

A matrix or data.frame of dimension n x p, containing the observations (row-wise).

k

The number of clusters initially searched for.

alpha

The proportion of observations to be trimmed.

nstart

The number of random initializations to be performed.

niter1

The number of concentration steps to be performed for the nstart initializations.

niter2

The maximum number of concentration steps to be performed for the nkeep solutions kept for further iteration. The concentration steps are stopped, whenever two consecutive steps lead to the same data partition.

nkeep

The number of iterated initializations (after niter1 concentration steps) with the best values in the target function that are kept for further iterations

iter.max

(deprecated, use the combination nkeep, niter1 and niter2) The maximum number of concentration steps to be performed. The concentration steps are stopped, whenever two consecutive steps lead to the same data partition.

points

Optional initial mean vectors, NULL or a matrix with k vectors used as means to initialize the algorithm. If initial mean vectors are specified, nstart should be 1 (otherwise the same initial means are used for all runs).

center

Optional centering of the data: a function or a vector of length p which can optionally be specified for centering x before calculation

scale

Optional scaling of the data: a function or a vector of length p which can optionally be specified for scaling x before calculation

store_x

A logical value, specifying whether the data matrix x shall be included in the result object. By default this value is set to TRUE, because some of the plotting functions depend on this information. However, when big data matrices are handled, the result object's size can be decreased noticeably when setting this parameter to FALSE.

parallel

A logical value, specifying whether the nstart initializations should be done in parallel.

n.cores

The number of cores to use when paralellizing, only taken into account if parallel=TRUE.

zero_tol

The zero tolerance used. By default set to 1e-16.

drop.empty.clust

Logical value specifying, whether empty clusters shall be omitted in the resulting object. (The result structure does not contain center estimates of empty clusters anymore. Cluster names are reassigned such that the first l clusters (l <= k) always have at least one observation.

trace

Defines the tracing level, which is set to 0 by default. Tracing level 1 gives additional information on the stage of the iterative process.

Value

The function returns the following values:

Author(s)

Valentin Todorov, Luis Angel Garcia Escudero, Agustin Mayo Iscar.

References

Cuesta-Albertos, J. A.; Gordaliza, A. and MatrĂ¡n, C. (1997), "Trimmed k-means: an attempt to robustify quantizers". Annals of Statistics, Vol. 25 (2), 553-576.

Examples


 
 ##--- EXAMPLE 1 ------------------------------------------
 sig <- diag(2)
 cen <- rep(1,2)
 x <- rbind(MASS::mvrnorm(360, cen * 0,   sig),
            MASS::mvrnorm(540, cen * 5,   sig),
            MASS::mvrnorm(100, cen * 2.5, sig))
 
 ## Two groups and 10\% trimming level
 (clus <- tkmeans(x, k = 2, alpha = 0.1))

 plot(clus)
 plot(clus, labels = "observation")
 plot(clus, labels = "cluster")

 #--- EXAMPLE 2 ------------------------------------------
 data(geyser2)
 (clus <- tkmeans(geyser2, k = 3, alpha = 0.03))
 plot(clus)
 

[Package tclust version 2.0-4 Index]