gaussian_mixture {clustlearn}R Documentation

Gaussian mixture model

Description

Perform Gaussian mixture model clustering on a data matrix.

Usage

gaussian_mixture(data, k, max_iter = 10, details = FALSE, waiting = TRUE, ...)

Arguments

data

a set of observations, presented as a matrix-like object where every row is a new observation.

k

the number of clusters to find.

max_iter

the maximum number of iterations to perform.

details

a Boolean determining whether intermediate logs explaining how the algorithm works should be printed or not.

waiting

a Boolean determining whether the intermediate logs should be printed in chunks waiting for user input before printing the next or not.

...

additional arguments passed to kmeans().

Details

The data given by data is clustered by the model-based algorithm that assumes every cluster follows a normal distribution, thus the name "Gaussian Mixture".

The normal distributions are parameterized by their mean vector, covariance matrix and mixing proportion. Initially, the mean vector is set to the cluster centers obtained by performing a k-means clustering on the data, the covariance matrix is set to the covariance matrix of the data points belonging to each cluster and the mixing proportion is set to the proportion of data points belonging to each cluster. The algorithm then optimizes the gaussian models by means of the Expectation Maximization (EM) algorithm.

The EM algorithm is an iterative algorithm that alternates between two steps:

Expectation

Compute how much is each observation expected to belong to each component of the GMM.

Maximization

Recompute the GMM according to the expectations from the E-step in order to maximize them.

The algorithm stops when the changes in the expectations are sufficiently small or when a maximum number of iterations is reached.

Value

A gaussian_mixture() object. It is a list with the following components:

cluster a vector of integers (from 1:k) indicating the cluster to which each point belongs.
mu the final mean parameters.
sigma the final covariance matrices.
lambda the final mixing proportions.
loglik the final log likelihood.
all.loglik a vector of each iteration's log likelihood.
iter the number of iterations performed.
size a vector with the number of data points belonging to each cluster.

Author(s)

Eduardo Ruiz Sabajanes, eduardo.ruizs@edu.uah.es

Examples

### !! This algorithm is very slow, so we'll only test it on some datasets !!

### Helper functions
dmnorm <- function(x, mu, sigma) {
  k <- ncol(sigma)

  x  <- as.matrix(x)
  diff <- t(t(x) - mu)

  num <- exp(-1 / 2 * diag(diff %*% solve(sigma) %*% t(diff)))
  den <- sqrt(((2 * pi)^k) * det(sigma))
  num / den
}

test <- function(db, k) {
  print(cl <- clustlearn::gaussian_mixture(db, k, 100))

  x <- seq(min(db[, 1]), max(db[, 1]), length.out = 100)
  y <- seq(min(db[, 2]), max(db[, 2]), length.out = 100)

  plot(db, col = cl$cluster, asp = 1, pch = 20)
  for (i in seq_len(k)) {
    m <- cl$mu[i, ]
    s <- cl$sigma[i, , ]
    f <- function(x, y) cl$lambda[i] * dmnorm(cbind(x, y), m, s)
    z <- outer(x, y, f)
    contour(x, y, z, col = i, add = TRUE)
  }
}

### Example 1
test(clustlearn::db1, 2)

### Example 2
# test(clustlearn::db2, 2)

### Example 3
test(clustlearn::db3, 3)

### Example 4
test(clustlearn::db4, 3)

### Example 5
test(clustlearn::db5, 3)

### Example 6
# test(clustlearn::db6, 3)

### Example 7 (with explanations, no plots)
cl <- clustlearn::gaussian_mixture(
  clustlearn::db5[1:20, ],
  3,
  details = TRUE,
  waiting = FALSE
)


[Package clustlearn version 1.0.0 Index]