GM {GMKMcharlie} | R Documentation |
Multithreaded Gaussian mixture trainer
Description
The traditional training algorithm via maximum likelihood for parameterizing weighted data with a mixture of Gaussian PDFs. Bounds on covariance matrix eigen ratios and mixture weights are optional.
Usage
GM(
X,
Xw = rep(1, ncol(X)),
alpha = numeric(0),
mu = matrix(ncol = 0, nrow = 0),
sigma = matrix(ncol = 0, nrow = 0),
G = 5L,
convergenceEPS = 1e-05,
convergenceTail = 10,
alphaEPS = 0,
eigenRatioLim = Inf,
embedNoise = 1e-6,
maxIter = 1000L,
maxCore = 7L,
tlimit = 3600,
verbose = TRUE,
updateAlpha = TRUE,
updateMean = TRUE,
updateSigma = TRUE,
checkInitialization = FALSE,
KmeansFirst = TRUE,
KmeansPPfirst = FALSE,
KmeansRandomSeed = NULL,
friendlyOutput = TRUE
)
Arguments
X |
A |
Xw |
A numeric vector of size |
alpha |
A numeric vector of size |
mu |
A |
sigma |
Either a list of |
G |
An integer. If at least one of the parameters |
convergenceEPS |
A numeric value. If the average change of all parameters in the mixture model is below |
convergenceTail |
If every one of the last |
alphaEPS |
A numeric value. During training, if any Gaussian kernel's weight is no greater than |
eigenRatioLim |
A numeric value. During training, if any Gaussian kernel's max:min eigen value ratio exceeds |
embedNoise |
A small constant added to the diagonal entries of all covariance matrices. This may prevent covariance matrices collapsing prematurely. A suggested value is 1e-6. Covariance degeneration is detected during Cholesky decomposition, and will lead the trainer to remove the corresponding mixture component. For high-dimensional problem, setting |
maxIter |
An integer, the maximal number of iterations. |
maxCore |
An integer. The maximal number of threads to invoke. Should be no more than the total number of logical processors on machine. Default 7. |
tlimit |
A numeric value. The program exits with the current model in |
verbose |
A boolean value. |
updateAlpha |
A boolean value or boolean vector. If a boolean value, |
updateMean |
A boolean value or a boolean vector. If a boolean value, |
updateSigma |
A boolean value or a boolean vector. If a boolean value, |
checkInitialization |
Check if any of the input covariance matrices are singular. |
KmeansFirst |
A boolean value. Run K-means clustering for finding means. |
KmeansPPfirst |
A boolean value. Run stochastic K-means++ for K-means initialization. |
KmeansRandomSeed |
An integer or |
friendlyOutput |
|
Details
An in-place Cholesky decomposition of covariance matrix is implemented for space and speed efficiency. Only the upper triangle of the Cholesky decomposition is memorized for each Gaussian kernel, and it is then applied to a forward substitution routine for fast Mahalanobis distances computation. Each of the three main steps in an iteration — Gaussian density computation, parameter inference, parameter update — is multithreaded and hand-scheduled using Intel TBB. Extensive efforts have been made over cache-friendliness and extra multithreading overheads such as memory allocation.
If eigenRatioLim
is finite, eigen decomposition employs routines in RcppArmadillo
.
Value
A list of size 5:
alpha |
a numeric vector of size |
mu |
a |
sigma |
a |
fitted |
a numeric vector of size |
clusterMember |
a list of |
Warning
For one-dimensional data, X
should still follow the data structure requirements: a matrix where each column is an observation.
Examples
# =============================================================================
# Examples below use 1 thread to pass CRAN check. Speed advantage of multiple
# threads will be more pronounced for larger data.
# =============================================================================
# =============================================================================
# Parameterize the iris data. Let the function initialize Gaussian kernels.
# =============================================================================
X = t(iris[1:4])
# CRAN check only allows 2 threads at most. Increase `maxCore` for
# acceleration.
gmmRst = GMKMcharlie::GM(X, G = 4L, maxCore = 1L, friendlyOutput = FALSE)
str(gmmRst)
# =============================================================================
# Parameterize the iris data given Gaussian kernels.
# =============================================================================
G = 3L
d = nrow(X) # Dimensionality.
alpha = rep(1, G) / G
mu = X[, sample(ncol(X), G)] # Sample observations as initial means.
# Take the average variance and create initial covariance matrices.
meanVarOfEachDim = sum(diag(var(t(X)))) / d
covar = diag(meanVarOfEachDim / G, d)
covars = matrix(rep(as.numeric(covar), G), nrow = d * d)
# Models are sensitive to initialization.
gmmRst2 = GMKMcharlie::GM(
X, alpha = alpha, mu = mu, sigma = covars, maxCore = 1L,
friendlyOutput = FALSE)
str(gmmRst2)
# =============================================================================
# For fun, fit Rosenbrock function with a Gaussian mixture.
# =============================================================================
set.seed(123)
rosenbrock <- function(x, y) {(1 - x) ^ 2 + 100 * (y - x ^ 2) ^ 2}
N = 2000L
x = runif(N, -2, 2)
y = runif(N, -1, 3)
z = rosenbrock(x, y)
X = rbind(x, y)
Xw = z * (N / sum(z)) # Weights on observations should sum up to N.
gmmFit = GMKMcharlie::GM(X, Xw = Xw, G = 5L, maxCore = 1L, verbose = FALSE,
friendlyOutput = FALSE)
oldpar = par()$mfrow
par(mfrow = c(1, 2))
plot3D::points3D(x, y, z, pch = 20)
plot3D::points3D(x, y, gmmFit$fitted, pch = 20)
par(mfrow = oldpar)
# =============================================================================
# For fun, fit a 3D spiral distribution.
# =============================================================================
N = 2000
t = runif(N) ^ 2 * 15
x = cos(t) + rnorm(N) * 0.1
y = sin(t) + rnorm(N) * 0.1
z = t + rnorm(N) * 0.1
X = rbind(x, y, z)
d = 3L
G = 10L
gmmFit = GMKMcharlie::GM(X, G = G, maxCore = 1L, verbose = FALSE,
KmeansFirst = TRUE, KmeansPPfirst = TRUE, KmeansRandomSeed = 42,
friendlyOutput = TRUE)
# Sample N points from the Gaussian mixture.
ns = as.integer(round(N * gmmFit$alpha))
sampledPoints = list()
for(i in 1:G)
{
sampledPoints[[i]] = MASS::mvrnorm(
ns[i], mu = gmmFit$mu[, i], Sigma = matrix(gmmFit$sigma[[i]], nrow = d))
}
sampledPoints =
matrix(unlist(lapply(sampledPoints, function(x) t(x))), nrow = d)
# Plot the original data and the samples from the mixture model.
oldpar = par()$mfrow
par(mfrow = c(1, 2))
plot3D::points3D(x, y, z, pch = 20)
plot3D::points3D(x = sampledPoints[1, ],
y = sampledPoints[2, ],
z = sampledPoints[3, ], pch = 20)
par(mfrow = oldpar)
# =============================================================================
# For fun, fit a 3D spiral distribution. Fix parameters at random.
# =============================================================================
N = 2000
t = runif(N) ^ 2 * 15
x = cos(t) + rnorm(N) * 0.1
y = sin(t) + rnorm(N) * 0.1
z = t + rnorm(N) * 0.1
X = rbind(x, y, z); dimnames(X) = NULL
d = 3L
G = 10L
mu = X[, sample(ncol(X), G)]
s = matrix(rep(as.numeric(cov(t(X))), G), ncol = G)
alpha = rep(1 / G, G)
updateAlpha = sample(c(TRUE, FALSE), G, replace = TRUE)
updateMean = sample(c(TRUE, FALSE), G, replace = TRUE)
updateSigma = sample(c(TRUE, FALSE), G, replace = TRUE)
gmmFit = GMKMcharlie::GM(X, alpha = alpha, mu = mu, sigma = s, G = G,
maxCore = 2, verbose = FALSE,
updateAlpha = updateAlpha,
updateMean = updateMean,
updateSigma = updateSigma,
convergenceEPS = 1e-5, alphaEPS = 1e-8,
friendlyOutput = TRUE)