dcem_train {DCEM}R Documentation

dcem_train: Part of DCEM package.

Description

Implements the EM algorithm. It calls the relevant clustering routine internally dcem_cluster_uv (univariate data) and dcem_cluster_mv (multivariate data).

Usage

dcem_train(data, threshold, iteration_count,  num_clusters, seed_meu, seeding)

Arguments

data

(dataframe): The dataframe containing the data. See trim_data for cleaning the data.

threshold

(decimal): A value to check for convergence (if the meu are within this value then the algorithm stops and exit). Default: 0.00001.

iteration_count

(numeric): The number of iterations for which the algorithm should run, if the convergence is not achieved within the specified count then the algorithm stops and exit. Default: 200.

num_clusters

(numeric): The number of clusters. Default: 2

seed_meu

(matrix): The user specified set of meu to use as initial centroids. Default: None

seeding

(string): The initialization scheme ('rand', 'improved'). Default: rand

Value

A list of objects. This list contains parameters associated with the Gaussian(s) (posterior probabilities, meu, sigma and priors). The parameters can be accessed as follows where sample_out is the list containing the output:

  1. (1) Posterior Probabilities: sample_out$prob: A matrix of posterior-probabilities

  2. (2) Meu: sample_out$meu

    For multivariate data: It is a matrix of meu(s). Each row in the matrix corresponds to one meu.

    For univariate data: It is a vector of meu(s). Each element of the vector corresponds to one meu.

  3. (3) Sigma: sample_out$sigma

    For multivariate data: List of co-variance matrices for the Gaussian(s).

    For univariate data: Vector of standard deviation for the Gaussian(s).

  4. (4) Priors: sample_out$prior: A vector of priors.

  5. (5) Membership: sample_out$membership: A dataframe of cluster membership for data. Columns numbers are data indices and values are the assigned clusters.

References

Parichit Sharma, Hasan Kurban, Mehmet Dalkilic DCEM: An R package for clustering big data via data-centric modification of Expectation Maximization, SoftwareX, 17, 100944 URL https://doi.org/10.1016/j.softx.2021.100944

Examples

# Simulating a mixture of univariate samples from three distributions
# with meu as 20, 70 and 100 and standard deviation as 10, 100 and 40 respectively.
sample_uv_data = as.data.frame(c(rnorm(100, 20, 5), rnorm(70, 70, 1), rnorm(50, 100, 2)))

# Randomly shuffle the samples.
sample_uv_data = as.data.frame(sample_uv_data[sample(nrow(sample_uv_data)),])

# Calling the dcem_train() function on the simulated data with threshold of
# 0.000001, iteration count of 1000 and random seeding respectively.
sample_uv_out = dcem_train(sample_uv_data, num_clusters = 3, iteration_count = 100,
threshold = 0.001)

# Simulating a mixture of multivariate samples from 2 gaussian distributions.
sample_mv_data = as.data.frame(rbind(MASS::mvrnorm(n=100, rep(2,5), Sigma = diag(5)),
MASS::mvrnorm(n=50, rep(14,5), Sigma = diag(5))))

# Calling the dcem_train() function on the simulated data with threshold of
# 0.00001, iteration count of 100 and random seeding method respectively.
sample_mv_out = dcem_train(sample_mv_data, threshold = 0.001, iteration_count = 100)

# Access the output
print(sample_mv_out$meu)
print(sample_mv_out$sigma)
print(sample_mv_out$prior)
print(sample_mv_out$prob)
print(sample_mv_out$membership)


[Package DCEM version 2.0.5 Index]