KMconstrained {GMKMcharlie} | R Documentation |
K-means over dense data input with constraints on cluster weights
Description
Multithreaded weighted Minkowski and spherical K-means via Lloyd's algorithm over dense representation of data given cluster size (weight) constraints.
Usage
KMconstrained(
X,
centroid,
Xw = rep(1, ncol(X)),
clusterWeightUB = rep(ncol(X) + 1, ncol(centroid)),
minkP = 2,
convergenceTail = 5L,
tailConvergedRelaErr = 1e-04,
maxIter = 100L,
maxCore = 7L,
paraSortInplaceMerge = FALSE,
verbose = TRUE
)
Arguments
X |
A |
centroid |
A |
Xw |
A numeric vector of size |
clusterWeightUB |
An integer vector of size |
minkP |
A numeric value or a character string. If numeric, |
convergenceTail |
An integer. The algorithm may end up with "cyclical convergence" due to the size / weight constraints, that is, every few iterations produce the same clustering. If the cost (total in-cluster distance) of each of the last |
tailConvergedRelaErr |
A numeric value, explained in |
maxIter |
An integer. The maximal number of iterations. Default 100. |
maxCore |
An integer. The maximal number of threads to invoke. No more than the total number of logical processors on machine. Default 7. |
paraSortInplaceMerge |
A boolean value. |
verbose |
A boolean value. |
Details
See details in KM()
for common implementation highlights. Weight upper bounds are implemented as follows:
In each iteration, all the (observation ID, centroid ID, distance) tuples are sorted by distance. From the first to the last tuple, the algorithm puts observation in the cluster labeled by the centroid ID, if (i) the observation has not already been assigned and (ii) the cluster size has not exceeded its upper bound. The actual implementation is slightly different. A parallel merge sort is crafted for computing speed.
Value
A list of size K
, the number of clusters. Each element is a list of 3 vectors:
centroid |
a numeric vector of size |
clusterMember |
an integer vector of indexes of elements grouped to |
member2centroidDistance |
a numeric vector of the same size of |
Empty clusterMember
implies empty cluster.
Note
Although rarely happens, divergence of K-means with non-Euclidean distance minkP != 2
measure is still a theoretical possibility. Bounding the cluster weights / sizes increases the chance of divergence.
Examples
N = 3000L # Number of points.
d = 500L # Dimensionality.
K = 50L # Number of clusters.
dat = matrix(rnorm(N * d) + runif(N * d), nrow = d)
# Use kmeans++ initialization.
centroidInd = GMKMcharlie::KMppIni(
X = dat, K, firstSelection = 1L, minkP = 2, stochastic = FALSE,
seed = sample(1e9L, 1), maxCore = 2L, verbose = TRUE)
centroid = dat[, centroidInd]
# Each cluster size should not be greater than N / K * 2.
sizeConstraints = as.integer(rep(N / K * 2, K))
system.time({rst = GMKMcharlie::KMconstrained(
X = dat, centroid = centroid, clusterWeightUB = sizeConstraints,
maxCore = 2L, tailConvergedRelaErr = 1e-6, verbose = TRUE)})
# Size upper bounds vary in [N / K * 1.5, N / K * 2]
sizeConstraints = as.integer(round(runif(K, N / K * 1.5, N / K * 2)))
system.time({rst = GMKMcharlie::KMconstrained(
X = dat, centroid = centroid, clusterWeightUB = sizeConstraints,
maxCore = 2L, tailConvergedRelaErr = 1e-6, verbose = TRUE)})