refit.tidylda {tidylda} | R Documentation |
Update a Latent Dirichlet Allocation topic model
Description
Update an LDA model using collapsed Gibbs sampling.
Usage
## S3 method for class 'tidylda'
refit(
object,
new_data,
iterations = NULL,
burnin = -1,
prior_weight = 1,
additional_k = 0,
additional_eta_sum = 250,
optimize_alpha = FALSE,
calc_likelihood = FALSE,
calc_r2 = FALSE,
return_data = FALSE,
threads = 1,
verbose = TRUE,
...
)
Arguments
object |
a fitted object of class |
new_data |
A document term matrix or term co-occurrence matrix of class dgCMatrix. |
iterations |
Integer number of iterations for the Gibbs sampler to run. |
burnin |
Integer number of burnin iterations. If |
prior_weight |
Numeric, 0 or greater or |
additional_k |
Integer number of topics to add, defaults to 0. |
additional_eta_sum |
Numeric magnitude of prior for additional topics.
Ignored if |
optimize_alpha |
Logical. Experimental. Do you want to optimize alpha
every iteration? Defaults to |
calc_likelihood |
Logical. Do you want to calculate the log likelihood every iteration?
Useful for assessing convergence. Defaults to |
calc_r2 |
Logical. Do you want to calculate R-squared after the model is trained?
Defaults to |
return_data |
Logical. Do you want |
threads |
Number of parallel threads, defaults to 1. |
verbose |
Logical. Do you want to print a progress bar out to the console?
Defaults to |
... |
Additional arguments, currently unused |
Details
refit
allows you to (a) update the probabilities (i.e. weights) of
a previously-fit model with new data or additional iterations and (b) optionally
use beta
of a previously-fit LDA topic model as the eta
prior
for the new model. This is tuned by setting beta_as_prior = FALSE
or
beta_as_prior = TRUE
respectively.
prior_weight
tunes how strong the base model is represented in the prior.
If prior_weight = 1
, then the tokens from the base model's training data
have the same relative weight as tokens in new_data
. In other words,
it is like just adding training data. If prior_weight
is less than 1,
then tokens in new_data
are given more weight. If prior_weight
is greater than 1, then the tokens from the base model's training data are
given more weight.
If prior_weight
is NA
, then the new eta
is equal to
eta
from the old model, with new tokens folded in.
(For handling of new tokens, see below.) Effectively, this just controls
how the sampler initializes (described below), but does not give prior
weight to the base model.
Instead of initializing token-topic assignments in the manner for new
models (see tidylda
), the update initializes in 2
steps:
First, topic-document probabilities (i.e. theta
) are obtained by a
call to predict.tidylda
using method = "dot"
for the documents in new_data
. Next, both beta
and theta
are
passed to an internal function, initialize_topic_counts
,
which assigns topics to tokens in a manner approximately proportional to
the posteriors and executes a single Gibbs iteration.
refit
handles the addition of new vocabulary by adding a flat prior
over new tokens. Specifically, each entry in the new prior is equal to the
10th percentile of eta
from the old model. The resulting model will
have the total vocabulary of the old model plus any new vocabulary tokens.
In other words, after running refit.tidylda
ncol(beta) >= ncol(new_data)
where beta
is from the new model and new_data
is the additional data.
You can add additional topics by setting the additional_k
parameter
to an integer greater than zero. New entries to alpha
have a flat
prior equal to the median value of alpha
in the old model. (Note that
if alpha
itself is a flat prior, i.e. scalar, then the new topics have
the same value for their prior.) New entries to eta
have a shape
from the average of all previous topics in eta
and scaled by
additional_eta_sum
.
Value
Returns an S3 object of class c("tidylda").
Note
Updates are, as of this writing, are almost-surely useful but their behaviors have not been optimized or well-studied. Caveat emptor!
Examples
# load a document term matrix
data(nih_sample_dtm)
d1 <- nih_sample_dtm[1:50, ]
d2 <- nih_sample_dtm[51:100, ]
# fit a model
m <- tidylda(d1,
k = 10,
iterations = 200, burnin = 175
)
# update an existing model by adding documents using old model as prior
m2 <- refit(
object = m,
new_data = rbind(d1, d2),
iterations = 200,
burnin = 175,
prior_weight = 1
)
# use an old model to initialize new model and not use old model as prior
m3 <- refit(
object = m,
new_data = d2, # new documents only
iterations = 200,
burnin = 175,
prior_weight = NA
)
# add topics while updating a model by adding documents
m4 <- refit(
object = m,
new_data = rbind(d1, d2),
additional_k = 3,
iterations = 200,
burnin = 175
)