KM {GMKMcharlie} | R Documentation |
K-means over dense representation of data
Description
Multithreaded weighted Minkowski and spherical K-means via Lloyd's algorithm over dense representation of data.
Usage
KM(
X,
centroid,
Xw = rep(1, ncol(X)),
minkP = 2,
maxIter = 100L,
maxCore = 7L,
verbose = TRUE
)
Arguments
X |
A |
centroid |
A |
Xw |
A numeric vector of size |
minkP |
A numeric value or a character string. If numeric, |
maxIter |
An integer. The maximal number of iterations. Default 100. |
maxCore |
An integer. The maximal number of threads to invoke. No more than the total number of logical processors on machine. Default 7. |
verbose |
A boolean value. |
Details
Implementation highlights include:
(i) In Minkowski distance calculation, integer power no greater than 30 uses multiplications. Fractional powers or powers above 30 call std::pow()
.
(ii) Multithreaded observation-centroid distance calculations. Distances are memorized to avoid unnecessary recomputations if centroids did not change in the last iteration.
(iii) A lookup table is built for storing observation - centroid ID pairs during the assignment step. Observation IDs are then grouped by centroid IDs which allows parallel computing cluster means.
(iv) Function allows non-uniform weights on observations.
(v) Meta-template programming trims branches over different distance functions and other computing methods during compile time.
Value
A list of size K
, the number of clusters. Each element is a list of 3 vectors:
centroid |
a numeric vector of size |
clusterMember |
an integer vector of indexes of elements grouped to |
member2centroidDistance |
a numeric vector of the same size of |
Empty clusterMember
implies empty cluster.
Note
Although rarely happens, divergence of K-means with non-Euclidean distance minkP != 2
measure is still a theoretical possibility.
Examples
# ===========================================================================
# Play random numbers. See speed.
# ===========================================================================
N = 5000L # Number of points.
d = 500L # Dimensionality.
K = 50L # Number of clusters.
dat = matrix(rnorm(N * d) + runif(N * d), nrow = d)
# Use kmeans++ initialization.
centroidInd = GMKMcharlie::KMppIni(
X = dat, K, firstSelection = 1L, minkP = 2, stochastic = FALSE,
seed = sample(1e9L, 1), maxCore = 2L, verbose = TRUE)
centroid = dat[, centroidInd]
# Euclidean.
system.time({rst = GMKMcharlie::KM(
X = dat, centroid = centroid, maxIter = 100,
minkP = 2, maxCore = 2, verbose = TRUE)})
# Cosine dissimilarity.
dat = apply(dat, 2, function(x) x / sum(x ^ 2) ^ 0.5)
centroid = dat[, centroidInd]
system.time({rst2 = GMKMcharlie::KM(
X = dat, centroid = centroid, maxIter = 100,
minkP = "cosine", maxCore = 2, verbose = TRUE)})
# ===========================================================================
# Test against R's inbuilt km()
# ===========================================================================
dat = t(iris[1:4])
dimnames(dat) = NULL
# Use kmeans++ initialization.
centroidInd = GMKMcharlie::KMppIni(
X = dat, K = 3, firstSelection = 1L, minkP = 2, stochastic = FALSE,
seed = sample(1e9L, 1), maxCore = 2L, verbose = TRUE)
centroid = dat[, centroidInd]
rst = GMKMcharlie::KM(X = dat, centroid = centroid, maxIter = 100,
minkP = 2, maxCore = 2, verbose = TRUE)
rst = lapply(rst, function(x) sort(x$clusterMember))
rst2 = kmeans(x = t(dat), centers = t(centroid), algorithm = "Lloyd")
rst2 = aggregate(list(1L : length(rst2$cluster)),
list(rst2$cluster), function(x) sort(x))[[2]]
setdiff(rst, rst2)