KMsparse {GMKMcharlie} | R Documentation |
K-means over sparse representation of data
Description
Multithreaded weighted Minkowski and spherical K-means via Lloyd's algorithm over sparse representation of data.
Usage
KMsparse(
X,
d,
centroid,
Xw = rep(1, length(X)),
minkP = 2,
maxIter = 100L,
maxCore = 7L,
verbose = TRUE
)
Arguments
X |
A list of size |
d |
An integer. The dimensionality of |
centroid |
A list of size |
Xw |
A numeric vector of size |
minkP |
A numeric value or a character string. If numeric, |
maxIter |
An integer. The maximal number of iterations. Default 100. |
maxCore |
An integer. The maximal number of threads to invoke. No more than the total number of logical processors on machine. Default 7. |
verbose |
A boolean value. |
Details
See details in KM()
for implementation highlights. There are some other optimizations such as, except for the maximum norm, cost of computing the distance between a dense centroid vector and a sparse observation is linear to the size of the sparse observation, which should be largely less than the size of the dense vector. This is done by letting every centroid memorize its before-root Minkowski norm. The full distance can then be inferred from adding the residual norm to the partial distance.
Value
A list of size K
, the number of clusters. Each element is a list of 3 vectors:
centroid |
a numeric vector of size |
clusterMember |
an integer vector of indexes of elements grouped to |
member2centroidDistance |
a numeric vector of the same size of |
Empty clusterMember
implies empty cluster.
Note
Although rarely happens, divergence of K-means with non-Euclidean distance minkP != 2
measure is still a theoretical possibility.
Examples
# ===========================================================================
# Play random numbers. See speed.
# ===========================================================================
N = 10000L # Number of points.
d = 500L # Dimensionality.
K = 100L # Number of clusters.
# Create a data matrix, about 95% of which are zeros.
dat = matrix(unlist(lapply(1L : N, function(x)
{
tmp = numeric(d)
# Nonzero entries.
Nnz = as.integer(max(1, d * runif(1, 0, 0.05)))
tmp[sample(d, Nnz)] = runif(Nnz) + rnorm(Nnz)
tmp
})), nrow = d); gc()
# Convert to sparse representation.
# GMKMcharlie::d2s() acheives the same.
sparsedat = apply(dat, 2, function(x)
{
nonz = which(x != 0)
list(nonz, x[nonz])
}); gc()
centroidInd = sample(length(sparsedat), K)
# Test speed using dense representation.
centroid = dat[, centroidInd]
system.time({rst = GMKMcharlie::KM(
X = dat, centroid = centroid, maxIter = 100,
minkP = 2, maxCore = 2, verbose = TRUE)})
# Test speed using sparse representation.
sparseCentroid = sparsedat[centroidInd]
system.time({sparseRst = GMKMcharlie::KMsparse(
X = sparsedat, d = d, centroid = sparseCentroid,
maxIter = 100, minkP = 2, maxCore = 2, verbose = TRUE)})