dir_mult_GD {deepMOU}R Documentation

Dirichlet-Multinomial mixture model by Gradient Descend algorithm

Description

Performs parameter estimation by means of a Gradient Descend algorithm and cluster allocation for the Dirichlet-Multinomial mixture model.

Usage

dir_mult_GD(
  x,
  k,
  n_it = 100,
  eps = 1e-05,
  seed_choice = 1,
  KK = 20,
  min_iter = 2,
  init = NULL
)

Arguments

x

Document-term matrix describing the frequency of terms that occur in a collection of documents. Rows correspond to documents in the collection and columns correspond to terms.

k

Number of clusters/groups.

n_it

Number of Gradient Descend steps.

eps

Tolerance level for the convergence of the algorithm. Default is 1e-05.

seed_choice

Set seed for reproducible results.

KK

Maximum number of iterations allowed for the nlminb function (see below).

min_iter

Minimum number of Gradient Descend steps.

init

Vector containing the initial document allocations for the initialization of the algorithm. If NULL (default) initialization is carried out via spherical k-means (skmeans).

Details

Starting from the data given by x the Dirichlet-Multinomial mixture model is fitted and k clusters are obtained. The algorithm for the parameter estimation is the Gradiend Descend. In particular, the function assigns initial values to weights of the Dirichlet-Multinomial distribution for each cluster and inital weights for the elements of the mixture. The estimates are obtained with maximum n_it steps of the Descent Algorithm algorithm or until a tolerance level eps is reached; by using the posterior distribution of the latent variable z, the documents are allocated to the cluster which maximizes the posterior distribution. For further details see the references.

Value

A list containing the following elements:

x

The data matrix.

clusters

the clustering labels.

k

the number of clusters.

numobs

the sample size.

p

the vocabulary size.

likelihood

vector containing the likelihood values at each iteration.

pi_hat

estimated probabilities of belonging to the k clusters.

Theta

matrix containing the estimates of the Theta parameters for each cluster.

f_z_x

matrix containing the posterior probabilities of belonging to each cluster.

AIC

Akaike Information Criterion (AIC) value of the fitted model.

BIC

Bayesian Information Criterion (BIC) value of the fitted model.

References

Anderlucci L, Viroli C (2020). "Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data." Advances in Data Analysis and Classification, 14, 759-770. doi: 10.1007/s11634-020-00399-3.

Examples

# Load the CNAE2 dataset
data("CNAE2")

# Perform parameter estimation and clustering, very
# few iterations are used for this example
dir_CNAE2 = dir_mult_GD(x = CNAE2, k = 2, n_it = 2)

# Shows cluster labels to documents
dir_CNAE2$clusters


[Package deepMOU version 0.1.1 Index]