dir_mult_GD {deepMOU} | R Documentation |
Dirichlet-Multinomial mixture model by Gradient Descend algorithm
Description
Performs parameter estimation by means of a Gradient Descend algorithm and cluster allocation for the Dirichlet-Multinomial mixture model.
Usage
dir_mult_GD(
x,
k,
n_it = 100,
eps = 1e-05,
seed_choice = 1,
KK = 20,
min_iter = 2,
init = NULL
)
Arguments
x |
Document-term matrix describing the frequency of terms that occur in a collection of documents. Rows correspond to documents in the collection and columns correspond to terms. |
k |
Number of clusters/groups. |
n_it |
Number of Gradient Descend steps. |
eps |
Tolerance level for the convergence of the algorithm. Default is |
seed_choice |
Set seed for reproducible results. |
KK |
Maximum number of iterations allowed for the nlminb function (see below). |
min_iter |
Minimum number of Gradient Descend steps. |
init |
Vector containing the initial document allocations for the initialization of the algorithm. If NULL (default) initialization is carried out via spherical k-means (skmeans). |
Details
Starting from the data given by x
the Dirichlet-Multinomial mixture model is fitted
and k
clusters are obtained.
The algorithm for the parameter estimation is the Gradiend Descend.
In particular, the function assigns initial values to weights of the Dirichlet-Multinomial distribution for each cluster
and inital weights for the elements of the mixture. The estimates are obtained with maximum n_it
steps of the
Descent Algorithm algorithm or until a tolerance level eps
is reached; by using the posterior distribution
of the latent variable z, the documents are allocated to the cluster which maximizes the
posterior distribution.
For further details see the references.
Value
A list containing the following elements:
x |
The data matrix. |
clusters |
the clustering labels. |
k |
the number of clusters. |
numobs |
the sample size. |
p |
the vocabulary size. |
likelihood |
vector containing the likelihood values at each iteration. |
pi_hat |
estimated probabilities of belonging to the |
Theta |
matrix containing the estimates of the Theta parameters for each cluster. |
f_z_x |
matrix containing the posterior probabilities of belonging to each cluster. |
AIC |
Akaike Information Criterion (AIC) value of the fitted model. |
BIC |
Bayesian Information Criterion (BIC) value of the fitted model. |
References
Anderlucci L, Viroli C (2020). "Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data." Advances in Data Analysis and Classification, 14, 759-770. doi: 10.1007/s11634-020-00399-3.
Examples
# Load the CNAE2 dataset
data("CNAE2")
# Perform parameter estimation and clustering, very
# few iterations are used for this example
dir_CNAE2 = dir_mult_GD(x = CNAE2, k = 2, n_it = 2)
# Shows cluster labels to documents
dir_CNAE2$clusters