KMconstrainedSparse {GMKMcharlie} | R Documentation |
K-means over sparse data input with constraints on cluster weights
Description
Multithreaded weighted Minkowski and spherical K-means via Lloyd's algorithm over sparse representation of data given cluster size (weight) constraints.
Usage
KMconstrainedSparse(
X,
d,
centroid,
Xw = rep(1, length(X)),
clusterWeightUB = rep(length(X) + 1, length(centroid)),
minkP = 2,
convergenceTail = 5L,
tailConvergedRelaErr = 1e-04,
maxIter = 100L,
maxCore = 7L,
paraSortInplaceMerge = FALSE,
verbose = TRUE
)
Arguments
X |
A list of size |
d |
An integer. The dimensionality of |
centroid |
A list of size |
Xw |
A numeric vector of size |
clusterWeightUB |
An integer vector of size |
minkP |
A numeric value or a character string. If numeric, |
convergenceTail |
An integer. The algorithm may end up with "cyclical convergence" due to the size / weight constraints, that is, every few iterations produce the same clustering. If the cost (total in-cluster distance) of each of the last |
tailConvergedRelaErr |
A numeric value, explained in |
maxIter |
An integer. The maximal number of iterations. Default 100. |
maxCore |
An integer. The maximal number of threads to invoke. No more than the total number of logical processors on machine. Default 7. |
paraSortInplaceMerge |
A boolean value. |
verbose |
A boolean value. |
Details
See details for KMconstrained()
and KM()
Value
A list of size K
, the number of clusters. Each element is a list of 3 vectors:
centroid |
a numeric vector of size |
clusterMember |
an integer vector of indexes of elements grouped to |
member2centroidDistance |
a numeric vector of the same size of |
Empty clusterMember
implies empty cluster.
Note
Although rarely happens, divergence of K-means with non-Euclidean distance minkP != 2
measure is still a theoretical possibility. Bounding the cluster weights / sizes increases the chance of divergence.
Examples
N = 5000L # Number of points.
d = 500L # Dimensionality.
K = 50L # Number of clusters.
# Create a data matrix, about 95% of which are zeros.
dat = matrix(unlist(lapply(1L : N, function(x)
{
tmp = numeric(d)
# Nonzero entries.
Nnz = as.integer(max(1, d * runif(1, 0, 0.05)))
tmp[sample(d, Nnz)] = runif(Nnz) + rnorm(Nnz)
tmp
})), nrow = d); gc()
# Convert to sparse representation.
# GMKMcharlie::d2s() is equivalent.
sparsedat = apply(dat, 2, function(x)
{
nonz = which(x != 0)
list(nonz, x[nonz])
}); gc()
centroidInd = sample(length(sparsedat), K)
# Test speed using sparse representation.
sparseCentroid = sparsedat[centroidInd]
# Size upper bounds vary in [N / K * 1.5, N / K * 2]
sizeConstraints = as.integer(round(runif(K, N / K * 1.5, N / K * 2)))
system.time({sparseRst = GMKMcharlie::KMconstrainedSparse(
X = sparsedat, d = d, centroid = sparseCentroid,
clusterWeightUB = sizeConstraints,
tailConvergedRelaErr = 1e-6,
maxIter = 100, minkP = 2, maxCore = 2, verbose = TRUE)})