R: Mixture of Unigrams by Expectation-Maximization algorithm

mou_EM {deepMOU}

R Documentation

Mixture of Unigrams by Expectation-Maximization algorithm

Description

Performs parameter estimation by means of the Expectation-Maximization (EM) algorithm and cluster allocation for the Mixture of Unigrams.

Usage

mou_EM(x, k, n_it = 100, eps = 1e-07, seed_choice = 1, init = NULL)

Arguments

`x`	Document-term matrix describing the frequency of terms that occur in a collection of documents. Rows correspond to documents in the collection and columns correspond to terms.
`k`	Number of clusters/groups.
`n_it`	Number of iterations for the Expectation-Maximization algorithm.
`eps`	Tolerance level for the convergence of the algorithm. Default is `1e-07`.
`seed_choice`	Set seed for reproducible results
`init`	Vector containing the initial document allocations for the initialization of the algorithm. If NULL (default) initialization is carried out via spherical k-means (skmeans).

Details

Starting from the data given by x the Mixture of Unigrams is fitted and k clusters are obtained. The algorithm for the parameter estimation is the Expectation-Maximization (EM). In particular, the function assigns initial values to weights of the Multinomial distribution for each cluster and inital weights for the components of the mixture. The estimates are obtained with maximum n_it steps of the EM algorithm or until the tolerance level eps is reached; by using the posterior distribution of the latent variable z, the documents are allocated to the cluster which maximizes the posterior distribution. For further details see the references.

Value

A list containing the following elements:

`x`	The data matrix.
`clusters`	the clustering labels.
`k`	the number of clusters.
`numobs`	the sample size.
`p`	the vocabulary size.
`likelihood`	vector containing the likelihood values at each iteration.
`pi_hat`	estimated probabilities of belonging to the `k` clusters.
`omega`	matrix containing the estimates of the omega parameters for each cluster.
`f_z_x`	matrix containing the posterior probabilities of belonging to each cluster.
`AIC`	Akaike Information Criterion (AIC) value of the fitted model.
`BIC`	Bayesian Information Criterion (BIC) value of the fitted model.

References

Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Machine learning 39, 103-134 (2000).

Examples

# Load the CNAE2 dataset
data("CNAE2")

# Perform parameter estimation and clustering
mou_CNAE2 = mou_EM(x = CNAE2, k = 2)

# Shows cluster labels to documents
mou_CNAE2$clusters

[Package deepMOU version 0.1.1 Index]