R: K-means over dense data input with constraints on cluster...

KMconstrained {GMKMcharlie}

R Documentation

K-means over dense data input with constraints on cluster weights

Description

Multithreaded weighted Minkowski and spherical K-means via Lloyd's algorithm over dense representation of data given cluster size (weight) constraints.

Usage

KMconstrained(
  X,
  centroid,
  Xw = rep(1, ncol(X)),
  clusterWeightUB = rep(ncol(X) + 1, ncol(centroid)),
  minkP = 2,
  convergenceTail = 5L,
  tailConvergedRelaErr = 1e-04,
  maxIter = 100L,
  maxCore = 7L,
  paraSortInplaceMerge = FALSE,
  verbose = TRUE
  )

Arguments

`X`	A `d x N` numeric matrix where `N` is the number of data points — each column is an observation, and `d` is the dimensionality. Column-observation representation promotes cache locality.
`centroid`	A `d x K` numeric matrix where `K` is the number of clusters. Each column represents a cluster center.
`Xw`	A numeric vector of size `N`. `Xw[i]` is the weight on observation `X[, i]`. Users should normalize `Xw` such that the elements sum up to `N`. Default uniform weights for all observations.
`clusterWeightUB`	An integer vector of size `K`. The upper bound of weight for each cluster. If `Xw` are all 1, `clusterWeightUB` upper-bound cluster sizes.
`minkP`	A numeric value or a character string. If numeric, `minkP` is the power `p` in the definition of Minkowski distance. If character string, `"max"` implies Chebyshev distance, `"cosine"` implies cosine dissimilarity. Default 2.
`convergenceTail`	An integer. The algorithm may end up with "cyclical convergence" due to the size / weight constraints, that is, every few iterations produce the same clustering. If the cost (total in-cluster distance) of each of the last `convergenceTail` iterations has a relative difference less than `tailConvergedRelaErr` against the cost from the prior iteration, the program stops.
`tailConvergedRelaErr`	A numeric value, explained in `convergenceTail`.
`maxIter`	An integer. The maximal number of iterations. Default 100.
`maxCore`	An integer. The maximal number of threads to invoke. No more than the total number of logical processors on machine. Default 7.
`paraSortInplaceMerge`	A boolean value. `TRUE` let the algorithm call `std::inplace_merge()` (`std` refers to C++ STL namespace) instead of `std::merge()` for parallel-sorting the observation-centroid distances. In-place merge is slower but requires no extra memory.
`verbose`	A boolean value. `TRUE` prints progress.

Details

See details in KM() for common implementation highlights. Weight upper bounds are implemented as follows:

In each iteration, all the (observation ID, centroid ID, distance) tuples are sorted by distance. From the first to the last tuple, the algorithm puts observation in the cluster labeled by the centroid ID, if (i) the observation has not already been assigned and (ii) the cluster size has not exceeded its upper bound. The actual implementation is slightly different. A parallel merge sort is crafted for computing speed.

Value

A list of size K, the number of clusters. Each element is a list of 3 vectors:

`centroid`	a numeric vector of size `d`.
`clusterMember`	an integer vector of indexes of elements grouped to `centroid`.
`member2centroidDistance`	a numeric vector of the same size of `clusterMember`. The `i`th element is the Minkowski distance or cosine dissimilarity from `centroid` to the `clusterMember[i]`th observation in `X`.

Empty clusterMember implies empty cluster.

Note

Although rarely happens, divergence of K-means with non-Euclidean distance minkP != 2 measure is still a theoretical possibility. Bounding the cluster weights / sizes increases the chance of divergence.

Examples

N = 3000L # Number of points.
d = 500L # Dimensionality.
K = 50L # Number of clusters.
dat = matrix(rnorm(N * d) + runif(N * d), nrow = d)


# Use kmeans++ initialization.
centroidInd = GMKMcharlie::KMppIni(
  X = dat, K, firstSelection = 1L, minkP = 2, stochastic = FALSE,
  seed = sample(1e9L, 1), maxCore = 2L, verbose = TRUE)


centroid = dat[, centroidInd]


# Each cluster size should not be greater than N / K * 2.
sizeConstraints = as.integer(rep(N / K * 2, K))
system.time({rst = GMKMcharlie::KMconstrained(
  X = dat, centroid = centroid, clusterWeightUB = sizeConstraints,
  maxCore = 2L, tailConvergedRelaErr = 1e-6, verbose = TRUE)})


# Size upper bounds vary in [N / K * 1.5, N / K * 2]
sizeConstraints = as.integer(round(runif(K, N / K * 1.5, N / K * 2)))
system.time({rst = GMKMcharlie::KMconstrained(
  X = dat, centroid = centroid, clusterWeightUB = sizeConstraints,
  maxCore = 2L, tailConvergedRelaErr = 1e-6, verbose = TRUE)})

[Package GMKMcharlie version 1.1.5 Index]