DCEM {DCEM} | R Documentation |
DCEM: Clustering Big Data using Expectation Maximization Star (EM*) Algorithm.
Description
Implements the EM* and EM algorithm for clustering the (univariate and multivariate) Gaussian mixture data.
Demonstration and Testing
Cleaning the data:
The data should be cleaned (redundant columns should be removed). For example
columns containing the labels or redundant entries (such as a column of
all 0's or 1's). See trim_data
for details on
cleaning the data. Refer: dcem_test
for more details.
Understanding the output of dcem_test
The function dcem_test() returns a list of objects. This list contains the parameters associated with the Gaussian(s), posterior probabilities (prob), mean (meu), co-variance/standard-deviation(sigma) ,priors (prior) and cluster membership for data (membership).
Note: The routine dcem_test() is only for demonstration purpose.
The function dcem_test
calls the main routine
dcem_train
. See dcem_train
for further details.
How to run on your dataset
See dcem_train
and dcem_star_train
for examples.
Package organization
The package is organized as a set of preprocessing functions and the core clustering modules. These functions are briefly described below.
-
trim_data
: This is used to remove the columns from the dataset. The user should clean the dataset before calling the dcem_train routine. User can also clean the dataset themselves (without using trim_data) and then pass it to the dcem_train function -
dcem_star_train
anddcem_train
: These are the primary interface to the EM* and EM algorithms respectively. These function accept the cleaned dataset and other parameters (number of iterations, convergence threshold etc.) and run the algorithm until:The number of iterations is reached.
The convergence is achieved.
DCEM supports following initialization schemes
-
Random Initialization: Initializes the mean randomly. Refer
meu_uv
andmeu_mv
for initialization on univariate and multivariate data respectively. -
Improved Initialization: Based on the Kmeans++ idea published in, K-means++: The Advantages of Careful Seeding, David Arthur and Sergei Vassilvitskii. URL http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf. See
meu_uv_impr
andmeu_mv_impr
for details. Choice of initialization scheme can be specified as the seeding parameter during the training. See
dcem_train
for further details.
References
Parichit Sharma, Hasan Kurban, Mehmet Dalkilic DCEM: An R package for clustering big data via data-centric modification of Expectation Maximization, SoftwareX, 17, 100944 URL https://doi.org/10.1016/j.softx.2021.100944
External Packages: DCEM requires R packages 'mvtnorm'[1], 'matrixcalc'[2] 'RCPP'[3] and 'MASS'[4] for multivariate density calculation, checking matrix singularity, compiling routines written in C and simulating mixture of gaussians, respectively.
[1] Alan Genz, Frank Bretz, Tetsuhisa Miwa, Xuefei Mi, Friedrich Leisch, Fabian Scheipl, Torsten Hothorn (2019). mvtnorm: Multivariate Normal and t Distributions. R package version 1.0-7. URL http://CRAN.R-project.org/package=mvtnorm
[2] Frederick Novomestky (2012). matrixcalc: Collection of functions for matrix calculations. R package version 1.0-3. https://CRAN.R-project.org/package=matrixcalc
[3] Dirk Eddelbuettel and Romain Francois (2011). Rcpp: Seamless R and C++ Integration. Journal of Statistical Software, 40(8), 1-18. URL http://www.jstatsoft.org/v40/i08/.
[4] Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0
[5] K-Means++: The Advantages of Careful Seeding, David Arthur and Sergei Vassilvitskii. URL http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf