textmodel_lda {seededlda}R Documentation

Unsupervised Latent Dirichlet allocation

Description

Implements unsupervised Latent Dirichlet allocation (LDA). Users can run Seeded LDA by setting gamma > 0.

Usage

textmodel_lda(
  x,
  k = 10,
  max_iter = 2000,
  auto_iter = FALSE,
  alpha = 0.5,
  beta = 0.1,
  gamma = 0,
  model = NULL,
  batch_size = 1,
  verbose = quanteda_options("verbose")
)

Arguments

x

the dfm on which the model will be fit.

k

the number of topics.

max_iter

the maximum number of iteration in Gibbs sampling.

auto_iter

if TRUE, stops Gibbs sampling on convergence before reaching max_iter. See details.

alpha

the values to smooth topic-document distribution.

beta

the values to smooth topic-word distribution.

gamma

a parameter to determine change of topics between sentences or paragraphs. When gamma > 0, Gibbs sampling of topics for the current document is affected by the previous document's topics.

model

a fitted LDA model; if provided, textmodel_lda() inherits parameters from an existing model. See details.

batch_size

split the corpus into the smaller batches (specified in proportion) for distributed computing; it is disabled when a batch include all the documents batch_size = 1.0. See details.

verbose

logical; if TRUE print diagnostic information during fitting.

Details

If auto_iter = TRUE, the iteration stops even before max_iter when delta <= 0. delta is computed to measure the changes in the number of words whose topics are updated by the Gibbs sampler in every 100 iteration as shown in the verbose message.

If batch_size < 1.0, the corpus is partitioned into sub-corpora of ndoc(x) * batch_size documents for Gibbs sampling in sub-processes with synchronization of parameters in every 10 iteration. Parallel processing is more efficient when batch_size is small (e.g. 0.01). The algorithm is the Approximate Distributed LDA proposed by Newman et al. (2009). User can changed the number of sub-processes used for the parallel computing via options(seededlda_threads).

set.seed() should be called immediately before textmodel_lda() or textmodel_seededlda() to control random topic assignment. If the random number seed is the same, the serial algorithm produces identical results; the parallel algorithm produces non-identical results because it classifies documents in different orders using multiple processors.

To predict topics of new documents (i.e. out-of-sample), first, create a new LDA model from a existing LDA model passed to model in textmodel_lda(); second, apply topics() to the new model. The model argument takes objects created either by textmodel_lda() or textmodel_seededlda().

Value

Returns a list of model parameters:

k

the number of topics.

last_iter

the number of iterations in Gibbs sampling

phi

the distribution of words over topics.

theta

the distribution of topics over documents.

words

the raw frequency count of words assigned to topics.

data

the original input of x.

call

the command used to execute the function.

version

the version of the seededlda package.

References

Newman, D., Asuncion, A., Smyth, P., & Welling, M. (2009). Distributed Algorithms for Topic Models. The Journal of Machine Learning Research, 10, 1801–1828.

See Also

LDA weightedLDA

Examples


require(seededlda)
require(quanteda)

corp <- head(data_corpus_moviereviews, 500)
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_number = TRUE)
dfmt <- dfm(toks) %>%
    dfm_remove(stopwords("en"), min_nchar = 2) %>%
    dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")

lda <- textmodel_lda(dfmt, k = 6, max_iter = 500) # 6 topics
terms(lda)
topics(lda)


[Package seededlda version 1.3.2 Index]