dcem_star_train {DCEM} | R Documentation |
dcem_star_train: Part of DCEM package.
Description
Implements the improved EM* ([1], [2]) algorithm. EM* avoids revisiting all but high
expressive data via structure based data segregation thus resulting in significant speed gain.
It calls the dcem_star_cluster_uv
routine internally (univariate data) and
dcem_star_cluster_mv
for (multivariate data).
Usage
dcem_star_train(data, iteration_count, num_clusters, seed_meu, seeding)
Arguments
data |
(dataframe): The dataframe containing the data. See |
iteration_count |
(numeric): The number of iterations for which the algorithm should run, if the convergence is not achieved then the algorithm stops and exit. Default: 200. |
num_clusters |
(numeric): The number of clusters. Default: 2 |
seed_meu |
(matrix): The user specified set of meu to use as initial centroids. Default: None |
seeding |
(string): The initialization scheme ('rand', 'improved'). Default: rand |
Value
A list of objects. This list contains parameters associated with the Gaussian(s) (posterior probabilities, meu, sigma and priors). The parameters can be accessed as follows where sample_out is the list containing the output:
(1) Posterior Probabilities: sample_out$prob A matrix of posterior-probabilities.
(2) Meu(s): sample_out$meu
For multivariate data: It is a matrix of meu(s). Each row in the matrix corresponds to one mean.
For univariate data: It is a vector of meu(s). Each element of the vector corresponds to one meu.
(3) Co-variance matrices: sample_out$sigma
For multivariate data: List of co-variance matrices.
Standard-deviation: sample_out$sigma
For univariate data: Vector of standard deviation.
(4) Priors: sample_out$prior A vector of priors.
(5) Membership: sample_out$membership: A dataframe of cluster membership for data. Columns numbers are data indices and values are the assigned clusters.
References
Parichit Sharma, Hasan Kurban, Mehmet Dalkilic DCEM: An R package for clustering big data via data-centric modification of Expectation Maximization, SoftwareX, 17, 100944 URL https://doi.org/10.1016/j.softx.2021.100944
Examples
# Simulating a mixture of univariate samples from three distributions
# with mean as 20, 70 and 100 and standard deviation as 10, 100 and 40 respectively.
sample_uv_data = as.data.frame(c(rnorm(100, 20, 5), rnorm(70, 70, 1), rnorm(50, 100, 2)))
# Randomly shuffle the samples.
sample_uv_data = as.data.frame(sample_uv_data[sample(nrow(sample_uv_data)),])
# Calling the dcem_star_train() function on the simulated data with iteration count of 1000
# and random seeding respectively.
sample_uv_out = dcem_star_train(sample_uv_data, num_clusters = 3, iteration_count = 100)
# Simulating a mixture of multivariate samples from 2 gaussian distributions.
sample_mv_data = as.data.frame(rbind(MASS::mvrnorm(n=2, rep(2,5), Sigma = diag(5)),
MASS::mvrnorm(n=5, rep(14,5), Sigma = diag(5))))
# Calling the dcem_star_train() function on the simulated data with iteration count of 100 and
# random seeding method respectively.
sample_mv_out = dcem_star_train(sample_mv_data, iteration_count = 100, num_clusters=2)
# Access the output
sample_mv_out$meu
sample_mv_out$sigma
sample_mv_out$prior
sample_mv_out$prob
print(sample_mv_out$membership)