lda.collapsed.gibbs.sampler {lda} | R Documentation |
Functions to Fit LDA-type models
Description
These functions use a collapsed Gibbs sampler to fit three different models: latent Dirichlet allocation (LDA), the mixed-membership stochastic blockmodel (MMSB), and supervised LDA (sLDA). These functions take sparsely represented input documents, perform inference, and return point estimates of the latent parameters using the state at the last iteration of Gibbs sampling. Multinomial logit for sLDA is supported using the multinom function from nnet package .
Usage
lda.collapsed.gibbs.sampler(documents, K, vocab, num.iterations, alpha,
eta, initial = NULL, burnin = NULL, compute.log.likelihood = FALSE,
trace = 0L, freeze.topics = FALSE)
slda.em(documents, K, vocab, num.e.iterations, num.m.iterations, alpha,
eta, annotations, params, variance, logistic = FALSE, lambda = 10,
regularise = FALSE, method = "sLDA", trace = 0L, MaxNWts=3000,
initial = NULL)
mmsb.collapsed.gibbs.sampler(network, K, num.iterations, alpha,
beta.prior, initial = NULL, burnin = NULL, trace = 0L)
lda.cvb0(documents, K, vocab, num.iterations, alpha, eta, trace = 0L)
Arguments
documents |
A list whose length is equal to the number of documents, D. Each
element of documents is an integer matrix with two rows. Each
column of documents[[i]] (i.e., document documents[[i]][1, j] is a 0-indexed word identifier for the jth word in document i. That is, this should be an index - 1 into vocab. documents[[i]][2, j] is an integer specifying the number of times that word appears in the document. |
network |
For |
K |
An integer representing the number of topics in the model. |
vocab |
A character vector specifying the vocabulary words associated with the word indices used in documents. |
num.iterations |
The number of sweeps of Gibbs sampling over the entire corpus to make. |
num.e.iterations |
For |
num.m.iterations |
For |
alpha |
The scalar value of the Dirichlet hyperparameter for topic proportions. |
beta.prior |
For |
eta |
The scalar value of the Dirichlet hyperparamater for topic multinomials. |
initial |
A list of initial topic assignments for words. It should be in the same format as the assignments field of the return value. If this field is NULL, then the sampler will be initialized with random assignments. |
burnin |
A scalar integer indicating the number of Gibbs sweeps to consider
as burn-in (i.e., throw away) for |
compute.log.likelihood |
A scalar logical which when |
annotations |
A length D numeric vector of covariates associated with each
document. Only used by |
params |
For |
variance |
For |
logistic |
For |
lambda |
When regularise is |
regularise |
When |
method |
For |
trace |
When |
MaxNWts |
Input to the nnet's multinom function with a default value of 3000 maximum weights. Increasing this value may be necessary when using logistic sLDA with a large number of topics at the necessary expense of longer run times. |
freeze.topics |
When |
Value
A fitted model as a list with the following components:
assignments |
A list of length D. Each element of the list, say
|
topics |
A |
topic_sums |
A length K vector where each entry indicates the total number of times words were assigned to each topic. |
document_sums |
A |
log.likelihoods |
Only for |
document_expects |
This field only exists if burnin is non-NULL. This field is like document_sums but instead of only aggregating counts for the last iteration, this field aggegates counts over all iterations after burnin. |
net.assignments.left |
Only for
|
net.assignments.right |
Only for
|
blocks.neg |
Only for
|
blocks.pos |
Only for
|
model |
For |
coefs |
For |
Note
WARNING: This function does not compute precisely the correct thing when the count associated with a word in a document is not 1 (this is for speed reasons currently). A workaround when a word appears multiple times is to replicate the word across several columns of a document. This will likely be fixed in a future version.
Author(s)
Jonathan Chang (slycoder@gmail.com)
References
Blei, David M. and Ng, Andrew and Jordan, Michael. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003.
Airoldi , Edoardo M. and Blei, David M. and Fienberg, Stephen E. and Xing, Eric P. Mixed Membership Stochastic Blockmodels. Journal of Machine Learning Research, 2008.
Blei, David M. and McAuliffe, John. Supervised topic models. Advances in Neural Information Processing Systems, 2008.
Griffiths, Thomas L. and Steyvers, Mark. Finding scientific topics. Proceedings of the National Academy of Sciences, 2004.
Asuncion, A., Welling, M., Smyth, P., and Teh, Y. W. On smoothing and inference for topic models. Uncertainty in Artificial Intelligence, 2009.
See Also
read.documents
and lexicalize
can be used
to generate the input data to these models.
top.topic.words
,
predictive.distribution
, and slda.predict
for operations on the fitted models.
Examples
## See demos for the three functions:
## Not run: demo(lda)
## Not run: demo(slda)
## Not run: demo(mmsb)