R: Fit a Latent Dirichlet Allocation topic model

tidylda {tidylda}

R Documentation

Fit a Latent Dirichlet Allocation topic model

Description

Fit a Latent Dirichlet Allocation topic model using collapsed Gibbs sampling.

Usage

tidylda(
  data,
  k,
  iterations = NULL,
  burnin = -1,
  alpha = 0.1,
  eta = 0.05,
  optimize_alpha = FALSE,
  calc_likelihood = TRUE,
  calc_r2 = FALSE,
  threads = 1,
  return_data = FALSE,
  verbose = TRUE,
  ...
)

Arguments

`data`	A document term matrix or term co-occurrence matrix. The preferred class is a `dgCMatrix-class`. However there is support for any `Matrix-class` object as well as several other commonly-used classes such as `matrix`, `dfm`, `DocumentTermMatrix`, and `simple_triplet_matrix`
`k`	Integer number of topics.
`iterations`	Integer number of iterations for the Gibbs sampler to run.
`burnin`	Integer number of burnin iterations. If `burnin` is greater than -1, the resulting "beta" and "theta" matrices are an average over all iterations greater than `burnin`.
`alpha`	Numeric scalar or vector of length `k`. This is the prior for topics over documents.
`eta`	Numeric scalar, numeric vector of length `ncol(data)`, or numeric matrix with `k` rows and `ncol(data)` columns. This is the prior for words over topics.
`optimize_alpha`	Logical. Do you want to optimize alpha every iteration? Defaults to `FALSE`. See 'details' below for more information.
`calc_likelihood`	Logical. Do you want to calculate the log likelihood every iteration? Useful for assessing convergence. Defaults to `TRUE`.
`calc_r2`	Logical. Do you want to calculate R-squared after the model is trained? Defaults to `FALSE`. See `calc_lda_r2`.
`threads`	Number of parallel threads, defaults to 1. See Details, below.
`return_data`	Logical. Do you want `data` returned as part of the model object?
`verbose`	Logical. Do you want to print a progress bar out to the console? Defaults to `TRUE`.
`...`	Additional arguments, currently unused

Details

This function calls a collapsed Gibbs sampler for Latent Dirichlet Allocation written using the excellent Rcpp package. Some implementation notes follow:

Topic-token and topic-document assignments are not initialized based on a uniform-random sampling, as is common. Instead, topic-token probabilities (i.e. beta) are initialized by sampling from a Dirichlet distribution with eta as its parameter. The same is done for topic-document probabilities (i.e. theta) using alpha. Then an internal function is called (initialize_topic_counts) to run a single Gibbs iteration to initialize assignments of tokens to topics and topics to documents.

When you use burn-in iterations (i.e. burnin = TRUE), the resulting beta and theta matrices are calculated by averaging over every iteration after the specified number of burn-in iterations. If you do not use burn-in iterations, then the matrices are calculated from the last run only. Ideally, you'd burn in every iteration before convergence, then average over the chain after its converged (and thus every observation is independent).

If you set optimize_alpha to TRUE, then each element of alpha is proportional to the number of times each topic has be sampled that iteration averaged with the value of alpha from the previous iteration. This lets you start with a symmetric alpha and drift into an asymmetric one. However, (a) this probably means that convergence will take longer to happen or convergence may not happen at all. And (b) I make no guarantees that doing this will give you any benefit or that it won't hurt your model. Caveat emptor!

The log likelihood calculation is the same that can be found on page 9 of https://arxiv.org/pdf/1510.08628.pdf. The only difference is that the version in tidylda allows eta to be a vector or matrix. (Vector used in this function, matrix used for model updates in refit.tidylda. At present, the log likelihood function appears to be ok for assessing convergence. i.e. It has the right shape. However, it is, as of this writing, returning positive numbers, rather than the expected negative numbers. Looking into that, but in the meantime caveat emptor once again.

Parallelism, is not currently implemented. The threads argument is a placeholder for planned enhancements.

Value

Returns an S3 object of class tidylda. See new_tidylda.

Examples

# load some data
data(nih_sample_dtm)

# fit a model
set.seed(12345)
m <- tidylda(
  data = nih_sample_dtm[1:20, ], k = 5,
  iterations = 200, burnin = 175
)

str(m)

# predict on held-out documents using gibbs sampling "fold in"
p1 <- predict(m, nih_sample_dtm[21:100, ],
  method = "gibbs",
  iterations = 200, burnin = 175
)

# predict on held-out documents using the dot product method
p2 <- predict(m, nih_sample_dtm[21:100, ], method = "dot")

# compare the methods
barplot(rbind(p1[1, ], p2[1, ]), beside = TRUE, col = c("red", "blue"))

[Package tidylda version 0.0.5 Index]