tidylda {tidylda} | R Documentation |
Fit a Latent Dirichlet Allocation topic model
Description
Fit a Latent Dirichlet Allocation topic model using collapsed Gibbs sampling.
Usage
tidylda(
data,
k,
iterations = NULL,
burnin = -1,
alpha = 0.1,
eta = 0.05,
optimize_alpha = FALSE,
calc_likelihood = TRUE,
calc_r2 = FALSE,
threads = 1,
return_data = FALSE,
verbose = TRUE,
...
)
Arguments
data |
A document term matrix or term co-occurrence matrix. The preferred
class is a |
k |
Integer number of topics. |
iterations |
Integer number of iterations for the Gibbs sampler to run. |
burnin |
Integer number of burnin iterations. If |
alpha |
Numeric scalar or vector of length |
eta |
Numeric scalar, numeric vector of length |
optimize_alpha |
Logical. Do you want to optimize alpha every iteration?
Defaults to |
calc_likelihood |
Logical. Do you want to calculate the log likelihood every iteration?
Useful for assessing convergence. Defaults to |
calc_r2 |
Logical. Do you want to calculate R-squared after the model is trained?
Defaults to |
threads |
Number of parallel threads, defaults to 1. See Details, below. |
return_data |
Logical. Do you want |
verbose |
Logical. Do you want to print a progress bar out to the console?
Defaults to |
... |
Additional arguments, currently unused |
Details
This function calls a collapsed Gibbs sampler for Latent Dirichlet Allocation written using the excellent Rcpp package. Some implementation notes follow:
Topic-token and topic-document assignments are not initialized based on a
uniform-random sampling, as is common. Instead, topic-token probabilities
(i.e. beta
) are initialized by sampling from a Dirichlet distribution
with eta
as its parameter. The same is done for topic-document
probabilities (i.e. theta
) using alpha
. Then an internal
function is called (initialize_topic_counts
) to run
a single Gibbs iteration to initialize assignments of tokens to topics and
topics to documents.
When you use burn-in iterations (i.e. burnin = TRUE
), the resulting
beta
and theta
matrices are calculated by averaging over every
iteration after the specified number of burn-in iterations. If you do not
use burn-in iterations, then the matrices are calculated from the last run
only. Ideally, you'd burn in every iteration before convergence, then average
over the chain after its converged (and thus every observation is independent).
If you set optimize_alpha
to TRUE
, then each element of alpha
is proportional to the number of times each topic has be sampled that iteration
averaged with the value of alpha
from the previous iteration. This lets
you start with a symmetric alpha
and drift into an asymmetric one.
However, (a) this probably means that convergence will take longer to happen
or convergence may not happen at all. And (b) I make no guarantees that doing this
will give you any benefit or that it won't hurt your model. Caveat emptor!
The log likelihood calculation is the same that can be found on page 9 of
https://arxiv.org/pdf/1510.08628.pdf. The only difference is that the
version in tidylda
allows eta
to be a
vector or matrix. (Vector used in this function, matrix used for model
updates in refit.tidylda
. At present, the
log likelihood function appears to be ok for assessing convergence. i.e. It
has the right shape. However, it is, as of this writing, returning positive
numbers, rather than the expected negative numbers. Looking into that, but
in the meantime caveat emptor once again.
Parallelism, is not currently implemented. The threads
argument is a
placeholder for planned enhancements.
Value
Returns an S3 object of class tidylda
. See new_tidylda
.
Examples
# load some data
data(nih_sample_dtm)
# fit a model
set.seed(12345)
m <- tidylda(
data = nih_sample_dtm[1:20, ], k = 5,
iterations = 200, burnin = 175
)
str(m)
# predict on held-out documents using gibbs sampling "fold in"
p1 <- predict(m, nih_sample_dtm[21:100, ],
method = "gibbs",
iterations = 200, burnin = 175
)
# predict on held-out documents using the dot product method
p2 <- predict(m, nih_sample_dtm[21:100, ], method = "dot")
# compare the methods
barplot(rbind(p1[1, ], p2[1, ]), beside = TRUE, col = c("red", "blue"))