R: Dirichlet-Multinomial mixture model by Gradient Descend...

dir_mult_GD {deepMOU}

R Documentation

Dirichlet-Multinomial mixture model by Gradient Descend algorithm

Description

Performs parameter estimation by means of a Gradient Descend algorithm and cluster allocation for the Dirichlet-Multinomial mixture model.

Usage

dir_mult_GD(
  x,
  k,
  n_it = 100,
  eps = 1e-05,
  seed_choice = 1,
  KK = 20,
  min_iter = 2,
  init = NULL
)

Arguments

`x`	Document-term matrix describing the frequency of terms that occur in a collection of documents. Rows correspond to documents in the collection and columns correspond to terms.
`k`	Number of clusters/groups.
`n_it`	Number of Gradient Descend steps.
`eps`	Tolerance level for the convergence of the algorithm. Default is `1e-05`.
`seed_choice`	Set seed for reproducible results.
`KK`	Maximum number of iterations allowed for the nlminb function (see below).
`min_iter`	Minimum number of Gradient Descend steps.
`init`	Vector containing the initial document allocations for the initialization of the algorithm. If NULL (default) initialization is carried out via spherical k-means (skmeans).

Details

Starting from the data given by x the Dirichlet-Multinomial mixture model is fitted and k clusters are obtained. The algorithm for the parameter estimation is the Gradiend Descend. In particular, the function assigns initial values to weights of the Dirichlet-Multinomial distribution for each cluster and inital weights for the elements of the mixture. The estimates are obtained with maximum n_it steps of the Descent Algorithm algorithm or until a tolerance level eps is reached; by using the posterior distribution of the latent variable z, the documents are allocated to the cluster which maximizes the posterior distribution. For further details see the references.

Value

A list containing the following elements:

`x`	The data matrix.
`clusters`	the clustering labels.
`k`	the number of clusters.
`numobs`	the sample size.
`p`	the vocabulary size.
`likelihood`	vector containing the likelihood values at each iteration.
`pi_hat`	estimated probabilities of belonging to the `k` clusters.
`Theta`	matrix containing the estimates of the Theta parameters for each cluster.
`f_z_x`	matrix containing the posterior probabilities of belonging to each cluster.
`AIC`	Akaike Information Criterion (AIC) value of the fitted model.
`BIC`	Bayesian Information Criterion (BIC) value of the fitted model.

References

Anderlucci L, Viroli C (2020). "Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data." Advances in Data Analysis and Classification, 14, 759-770. doi: 10.1007/s11634-020-00399-3.

Examples

# Load the CNAE2 dataset
data("CNAE2")

# Perform parameter estimation and clustering, very
# few iterations are used for this example
dir_CNAE2 = dir_mult_GD(x = CNAE2, k = 2, n_it = 2)

# Shows cluster labels to documents
dir_CNAE2$clusters

[Package deepMOU version 0.1.1 Index]