KMppIniSparse {GMKMcharlie} | R Documentation |
Minkowski and spherical, deterministic and stochastic, multithreaded K-means++ initialization over sparse representation of data
Description
Find suitable observations as initial centroids.
Usage
KMppIniSparse(
X,
d,
K,
firstSelection = 1L,
minkP = 2,
stochastic = FALSE,
seed = 123,
maxCore = 7L,
verbose = TRUE
)
Arguments
X |
A list of size |
d |
An integer. The dimensionality of |
K |
An integer, the number of centroids. |
firstSelection |
An integer, index of the observation selected as the first initial centroid in |
minkP |
A numeric value or a character string. If numeric, |
stochastic |
A boolean value. |
seed |
Random seed if |
maxCore |
An integer. The maximal number of threads to invoke. No more than the total number of logical processors on machine. Default 7. |
verbose |
A boolean value. |
Details
In each iteration, the distances between the newly selected centroid and all the other observations are computed with multiple threads. Scheduling is homemade for minimizing the overhead of thread communication.
Value
An integer vector of size K
. The vector contains the indexes of observations selected as the initial centroids.
Examples
N = 2000L
d = 3000L
X = matrix(rnorm(N * d) + 2, nrow = d)
# Fill many zeros in X:
X = apply(X, 2, function(x) {
x[sort(sample(d, d * runif(1, 0.95, 0.99)))] = 0; x})
# Get the sparse version of X.
sparseX = GMKMcharlie::d2s(X)
K = 30L
seed = 123L
# Time cost of finding the centroids via dense representation.
# CRAN check allows only 2 threads. Increase `maxCore` for more speed.
system.time({kmppViaDense = GMKMcharlie::KMppIni(
X, K, firstSelection = 1L, minkP = 2, stochastic = TRUE, seed = seed,
maxCore = 2L)})
# Time cost of finding the initial centroids via sparse representation.
system.time({kmppViaSparse = GMKMcharlie::KMppIniSparse(
sparseX, d, K, firstSelection = 1L, minkP = 2, stochastic = TRUE,
seed = seed, maxCore = 2L)})
# Results should be identical.
sum(kmppViaSparse - kmppViaDense)