| bigkmeans {biganalytics} | R Documentation | 
Memory-efficient k-means cluster analysis
Description
k-means cluster analysis without the memory overhead, and possibly in parallel using shared memory.
Usage
bigkmeans(x, centers, iter.max = 10, nstart = 1, dist = "euclid")
Arguments
x | 
 a   | 
centers | 
 a scalar denoting the number of clusters, or for k clusters, 
a k by   | 
iter.max | 
 the maximum number of iterations.  | 
nstart | 
 number of random starts, to be done in parallel if there is a registered backend (see below).  | 
dist | 
 the distance function. Can be "euclid" or "cosine".  | 
Details
The real benefit is the lack of memory overhead compared to the 
standard kmeans function.  Part of the overhead from 
kmeans() stems from the way it looks for unique starting 
centers, and could be improved upon.  The bigkmeans() function 
works on either regular R matrix objects, or on big.matrix 
objects.  In either case, it requires no extra memory (beyond the data, 
other than recording the cluster memberships), whereas kmeans() 
makes at least two extra copies of the data.  And kmeans() is even 
worse if multiple starts (nstart>1) are used.  If nstart>1 
and you are using bigkmeans() in parallel, a vector of cluster 
memberships will need to be stored for each worker, which could be 
memory-intensive for large data.  This isn't a problem if you use are running
the multiple starts sequentially.
Unless you have a really big data set (where a single run of 
kmeans not only burns memory but takes more than a few 
seconds), use of parallel computing for multiple random starts is unlikely 
to be much faster than running iteratively.
Only the algorithm by MacQueen is used here.
Value
An object of class kmeans, just as produced by 
kmeans.
Note
A comment should be made about the excellent package foreach.  By
default, it provides foreach, which is used
much like a for loop, here over the nstart
and doing a final comparison of all results).
When a parallel backend has been registered (see packages doSNOW, 
doMC, and doMPI, for example), bigkmeans() automatically 
distributes the nstart random starting points across the available 
workers.  This is done in shared memory on an SMP, but is distributed on 
a cluster *IF* the big.matrix is file-backed.  If used on a cluster 
with an in-RAM big.matrix, it will fail horribly.  We're considering 
an extra option as an alternative to the current behavior.