R: K-Means Clustering

kmeans {clustlearn}

R Documentation

K-Means Clustering

Description

Perform K-Means clustering on a data matrix.

Usage

kmeans(
  data,
  centers,
  max_iterations = 10,
  initialization = "kmeans++",
  details = FALSE,
  waiting = TRUE,
  ...
)

Arguments

`data`	a set of observations, presented as a matrix-like object where every row is a new observation.
`centers`	either the number of clusters or a set of initial cluster centers. If a number, the centers are chosen according to the `initialization` parameter.
`max_iterations`	the maximum number of iterations allowed.
`initialization`	the initialization method to be used. This should be one of `"random"` or `"kmeans++"`. The latter is the default.
`details`	a Boolean determining whether intermediate logs explaining how the algorithm works should be printed or not.
`waiting`	a Boolean determining whether the intermediate logs should be printed in chunks waiting for user input before printing the next or not.
`...`	additional arguments passed to `proxy::dist()`.

Details

The data given by data is clustered by the k-means method, which aims to partition the points into k groups such that the sum of squares from points to the assigned cluster centers is minimized. At the minimum, all cluster centers are at the mean of their Voronoi sets (the set of data points which are nearest to the cluster center).

The k-means method follows a 2 to n step process:

The first step can be subdivided into 3 steps:
1. Selection of the number k of clusters, into which the data is going to be grouped and of which the centers will be the representatives. This is determined through the use of the centers parameter.
2. Computation of the distance from each data point to each center.
3. Assignment of each observation to a cluster. The observation is assigned to the cluster represented by the nearest center.
The next steps are just like the first but for the first sub-step:
1. Computation of the new centers. The center of each cluster is computed as the mean of the observations assigned to said cluster.

The algorithm stops once the centers in step n+1 are the same as the ones in step n. However, this convergence does not always take place. For this reason, the algorithm also stops once a maximum number of iterations max_iterations is reached.

The initialization methods provided by this function are:

random: A set of centers observations is chosen at random from the data as the initial centers.
kmeans++: The centers observations are chosen using the kmeans++ algorithm. This algorithm chooses the first center at random and then chooses the next center from the remaining observations with probability proportional to the square distance to the closest center. This process is repeated until centers centers are chosen.

Value

A stats::kmeans() object.

Author(s)

Eduardo Ruiz Sabajanes, eduardo.ruizs@edu.uah.es

Examples

### Voronoi tesselation
voronoi <- suppressMessages(suppressWarnings(require(deldir)))
cols <- c(
  "#00000019",
  "#DF536B19",
  "#61D04F19",
  "#2297E619",
  "#28E2E519",
  "#CD0BBC19",
  "#F5C71019",
  "#9E9E9E19"
)

### Helper function
test <- function(db, k) {
  print(cl <- clustlearn::kmeans(db, k, 100))
  plot(db, col = cl$cluster, asp = 1, pch = 20)
  points(cl$centers, col = seq_len(k), pch = 13, cex = 2, lwd = 2)

  if (voronoi) {
    x <- c(min(db[, 1]), max(db[, 1]))
    dx <- c(x[1] - x[2], x[2] - x[1])
    y <- c(min(db[, 2]), max(db[, 2]))
    dy <- c(y[1] - y[2], y[2] - y[1])
    tesselation <- deldir(
      cl$centers[, 1],
      cl$centers[, 2],
      rw = c(x + dx, y + dy)
    )
    tiles <- tile.list(tesselation)

    plot(
      tiles,
      asp = 1,
      add = TRUE,
      showpoints = FALSE,
      border = "#00000000",
      fillcol = cols
    )
  }
}

### Example 1
test(clustlearn::db1, 2)

### Example 2
test(clustlearn::db2, 2)

### Example 3
test(clustlearn::db3, 3)

### Example 4
test(clustlearn::db4, 3)

### Example 5
test(clustlearn::db5, 3)

### Example 6
test(clustlearn::db6, 3)

### Example 7 (with explanations, no plots)
cl <- clustlearn::kmeans(
  clustlearn::db5[1:20, ],
  3,
  details = TRUE,
  waiting = FALSE
)

[Package clustlearn version 1.0.0 Index]