R: K-means over sparse data input with constraints on cluster...

KMconstrainedSparse {GMKMcharlie}

R Documentation

K-means over sparse data input with constraints on cluster weights

Description

Multithreaded weighted Minkowski and spherical K-means via Lloyd's algorithm over sparse representation of data given cluster size (weight) constraints.

Usage

KMconstrainedSparse(
  X,
  d,
  centroid,
  Xw = rep(1, length(X)),
  clusterWeightUB = rep(length(X) + 1, length(centroid)),
  minkP = 2,
  convergenceTail = 5L,
  tailConvergedRelaErr = 1e-04,
  maxIter = 100L,
  maxCore = 7L,
  paraSortInplaceMerge = FALSE,
  verbose = TRUE
  )

Arguments

`X`	A list of size `N`, the number of observations. `X[[i]]` is a 2-column data frame. The 1st column is a sorted integer vector of the indexes of nonzero dimensions. Values in these dimensions are stored in the 2nd column as a numeric vector. Internally the algorithm sets a 32-bit int pointer to the beginning of the 1st column and a 64-bit double pointer to the beginning of the 2nd column, so it is critical that the input has the correct type.
`d`	An integer. The dimensionality of `X`. `d` MUST be no less than the maximum of all index vectors in `X`.
`centroid`	A list of size `K`, the number of clusters. `centroid[[i]]` can be in dense or sparse representation. If dense, a numeric vector of size `d`. If sparse, a 2-column data frame in the same sense as `X[[i]]`.
`Xw`	A numeric vector of size `N`. `Xw[i]` is the weight on observation `X[[i]]`. Users should normalize `Xw` such that the elements sum up to `N`. Default uniform weights for all observations.
`clusterWeightUB`	An integer vector of size `K`. The upper bound of weight for each cluster. If `Xw` are all 1s, `clusterWeightUB` upper-bound cluster sizes.
`minkP`	A numeric value or a character string. If numeric, `minkP` is the power `p` in the definition of Minkowski distance. If character string, `"max"` implies Chebyshev distance, `"cosine"` implies cosine dissimilarity. Default 2.
`convergenceTail`	An integer. The algorithm may end up with "cyclical convergence" due to the size / weight constraints, that is, every few iterations produce the same clustering. If the cost (total in-cluster distance) of each of the last `convergenceTail` iterations has a relative difference less than `tailConvergedRelaErr` against the cost from the prior iteration, the program stops.
`tailConvergedRelaErr`	A numeric value, explained in `convergenceTail`.
`maxIter`	An integer. The maximal number of iterations. Default 100.
`maxCore`	An integer. The maximal number of threads to invoke. No more than the total number of logical processors on machine. Default 7.
`paraSortInplaceMerge`	A boolean value. `TRUE` let the algorithm call `std::inplace_merge()` (`std` refers to C++ STL namespace) instead of `std::merge()` for parallel-sorting the observation-centroid distances. In-place merge is slower but requires no extra memory.
`verbose`	A boolean value. `TRUE` prints progress.

Details

See details for KMconstrained() and KM()

Value

A list of size K, the number of clusters. Each element is a list of 3 vectors:

`centroid`	a numeric vector of size `d`.
`clusterMember`	an integer vector of indexes of elements grouped to `centroid`.
`member2centroidDistance`	a numeric vector of the same size of `clusterMember`. The `i`th element is the Minkowski distance or cosine dissimilarity from `centroid` to the `clusterMember[i]`th observation in `X`.

Empty clusterMember implies empty cluster.

Note

Although rarely happens, divergence of K-means with non-Euclidean distance minkP != 2 measure is still a theoretical possibility. Bounding the cluster weights / sizes increases the chance of divergence.

Examples

N = 5000L # Number of points.
d = 500L # Dimensionality.
K = 50L # Number of clusters.


# Create a data matrix, about 95% of which are zeros.
dat = matrix(unlist(lapply(1L : N, function(x)
{
  tmp = numeric(d)
  # Nonzero entries.
  Nnz = as.integer(max(1, d * runif(1, 0, 0.05)))
  tmp[sample(d, Nnz)] = runif(Nnz) + rnorm(Nnz)
  tmp
})), nrow = d); gc()


# Convert to sparse representation.
# GMKMcharlie::d2s() is equivalent.
sparsedat = apply(dat, 2, function(x)
{
  nonz = which(x != 0)
  list(nonz, x[nonz])
}); gc()


centroidInd = sample(length(sparsedat), K)


# Test speed using sparse representation.
sparseCentroid = sparsedat[centroidInd]
# Size upper bounds vary in [N / K * 1.5, N / K * 2]
sizeConstraints = as.integer(round(runif(K, N / K * 1.5, N / K * 2)))
system.time({sparseRst = GMKMcharlie::KMconstrainedSparse(
  X = sparsedat, d = d, centroid = sparseCentroid,
  clusterWeightUB = sizeConstraints,
  tailConvergedRelaErr = 1e-6,
  maxIter = 100, minkP = 2, maxCore = 2, verbose = TRUE)})

[Package GMKMcharlie version 1.1.5 Index]