R: Parallel Model-Based Clustering

pmclust-package {pmclust}

R Documentation

Parallel Model-Based Clustering

Description

The pmclust aims to utilize model-based clustering (unsupervised) for high dimensional and ultra large data, especially in a distributed manner. The package employs pbdMPI to perform a parallel version of expectation and maximization (EM) algorithm for finite mixture Gaussian models. The unstructured dispersion matrices are assumed in the Gaussian models. The implementation is default in the single program multiple data (SPMD) programming model. The code can be executed through pbdMPI and independent to most MPI applications. See the High Performance Statistical Computing (HPSC) website for more information, documents and examples.

Details

Package:	pmclust
Type:	Package
License:	GPL
LazyLoad:	yes

The main function is pmclust implementing the parallel EM algorithm for mixture multivariate Gaussian models with unstructured dispersions. This function groups a data matrix X.gbd or X.spmd into K clusters where X.gbd or X.spmd is potentially huge and taken from the global environment .GlobalEnv or .pmclustEnv.

Other main functions em.step, aecm.step, apecm.step, and apecma.step may provide better performance than the em.step in terms of computing time and convergent iterations.

kmeans.step provides the fastest clustering among above algorithms, but it is restricted by Euclidean distance and spherical dispersions.

Author(s)

Wei-Chen Chen wccsnow@gmail.com and George Ostrouchov

References

Programming with Big Data in R Website: https://pbdr.org/

Chen, W.-C. and Maitra, R. (2011) “Model-based clustering of regression time series data via APECM – an AECM algorithm sung to an even faster beat”, Statistical Analysis and Data Mining, 4, 567-578.

Chen, W.-C., Ostrouchov, G., Pugmire, D., Prabhat, M., and Wehner, M. (2013) “A Parallel EM Algorithm for Model-Based Clustering with Application to Explore Large Spatio-Temporal Data”, Technometrics, (revision).

Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) “Maximum Likelihood from Incomplete Data via the EM Algorithm”, Journal of the Royal Statistical Society Series B, 39, 1-38.

Lloyd., S. P. (1982) “Least squares quantization in PCM”, IEEE Transactions on Information Theory, 28, 129-137.

Meng, X.-L. and Van Dyk, D. (1997) “The EM Algorithm – an Old Folk-song Sung to a Fast New Tune”, Journal of the Royal Statistical Society Series B, 59, 511-567.

Examples

## Not run: 
### Under command mode, run the demo with 2 processors by
### (Use Rscript.exe for windows system)
mpiexec -np 2 Rscript -e 'demo(gbd_em,"pmclust",ask=F,echo=F)'
mpiexec -np 2 Rscript -e 'demo(gbd_aecm,"pmclust",ask=F,echo=F)'
mpiexec -np 2 Rscript -e 'demo(gbd_apecm,"pmclust",ask=F,echo=F)'
mpiexec -np 2 Rscript -e 'demo(gbd_apecma,"pmclust",ask=F,echo=F)'
mpiexec -np 2 Rscript -e 'demo(gbd_kmeans,"pmclust",ask=F,echo=F)'

mpiexec -np 2 Rscript -e 'demo(ex_em,"pmclust",ask=F,echo=F)'
mpiexec -np 2 Rscript -e 'demo(ex_aecm,"pmclust",ask=F,echo=F)'
mpiexec -np 2 Rscript -e 'demo(ex_apecm,"pmclust",ask=F,echo=F)'
mpiexec -np 2 Rscript -e 'demo(ex_apecma,"pmclust",ask=F,echo=F)'
mpiexec -np 2 Rscript -e 'demo(ex_kmeans,"pmclust",ask=F,echo=F)'

## End(Not run)