R: Gaussian mixture model

gaussian_mixture {clustlearn}

R Documentation

Gaussian mixture model

Description

Perform Gaussian mixture model clustering on a data matrix.

Usage

gaussian_mixture(data, k, max_iter = 10, details = FALSE, waiting = TRUE, ...)

Arguments

`data`	a set of observations, presented as a matrix-like object where every row is a new observation.
`k`	the number of clusters to find.
`max_iter`	the maximum number of iterations to perform.
`details`	a Boolean determining whether intermediate logs explaining how the algorithm works should be printed or not.
`waiting`	a Boolean determining whether the intermediate logs should be printed in chunks waiting for user input before printing the next or not.
`...`	additional arguments passed to `kmeans()`.

Details

The data given by data is clustered by the model-based algorithm that assumes every cluster follows a normal distribution, thus the name "Gaussian Mixture".

The normal distributions are parameterized by their mean vector, covariance matrix and mixing proportion. Initially, the mean vector is set to the cluster centers obtained by performing a k-means clustering on the data, the covariance matrix is set to the covariance matrix of the data points belonging to each cluster and the mixing proportion is set to the proportion of data points belonging to each cluster. The algorithm then optimizes the gaussian models by means of the Expectation Maximization (EM) algorithm.

The EM algorithm is an iterative algorithm that alternates between two steps:

Expectation: Compute how much is each observation expected to belong to each component of the GMM.
Maximization: Recompute the GMM according to the expectations from the E-step in order to maximize them.

The algorithm stops when the changes in the expectations are sufficiently small or when a maximum number of iterations is reached.

Value

A gaussian_mixture() object. It is a list with the following components:

`cluster`	a vector of integers (from `1:k`) indicating the cluster to which each point belongs.
`mu`	the final mean parameters.
`sigma`	the final covariance matrices.
`lambda`	the final mixing proportions.
`loglik`	the final log likelihood.
`all.loglik`	a vector of each iteration's log likelihood.
`iter`	the number of iterations performed.
`size`	a vector with the number of data points belonging to each cluster.

Author(s)

Eduardo Ruiz Sabajanes, eduardo.ruizs@edu.uah.es

Examples

### !! This algorithm is very slow, so we'll only test it on some datasets !!

### Helper functions
dmnorm <- function(x, mu, sigma) {
  k <- ncol(sigma)

  x  <- as.matrix(x)
  diff <- t(t(x) - mu)

  num <- exp(-1 / 2 * diag(diff %*% solve(sigma) %*% t(diff)))
  den <- sqrt(((2 * pi)^k) * det(sigma))
  num / den
}

test <- function(db, k) {
  print(cl <- clustlearn::gaussian_mixture(db, k, 100))

  x <- seq(min(db[, 1]), max(db[, 1]), length.out = 100)
  y <- seq(min(db[, 2]), max(db[, 2]), length.out = 100)

  plot(db, col = cl$cluster, asp = 1, pch = 20)
  for (i in seq_len(k)) {
    m <- cl$mu[i, ]
    s <- cl$sigma[i, , ]
    f <- function(x, y) cl$lambda[i] * dmnorm(cbind(x, y), m, s)
    z <- outer(x, y, f)
    contour(x, y, z, col = i, add = TRUE)
  }
}

### Example 1
test(clustlearn::db1, 2)

### Example 2
# test(clustlearn::db2, 2)

### Example 3
test(clustlearn::db3, 3)

### Example 4
test(clustlearn::db4, 3)

### Example 5
test(clustlearn::db5, 3)

### Example 6
# test(clustlearn::db6, 3)

### Example 7 (with explanations, no plots)
cl <- clustlearn::gaussian_mixture(
  clustlearn::db5[1:20, ],
  3,
  details = TRUE,
  waiting = FALSE
)

[Package clustlearn version 1.0.0 Index]