mallet_tidiers {tidytext} | R Documentation |
Tidiers for Latent Dirichlet Allocation models from the mallet package
Description
Tidy LDA models fit by the mallet package, which wraps the Mallet topic
modeling package in Java. The arguments and return values
are similar to lda_tidiers()
.
Usage
## S3 method for class 'jobjRef'
tidy(
x,
matrix = c("beta", "gamma"),
log = FALSE,
normalized = TRUE,
smoothed = TRUE,
...
)
## S3 method for class 'jobjRef'
augment(x, data, ...)
Arguments
x |
A jobjRef object, of type RTopicModel, such as created
by |
matrix |
Whether to tidy the beta (per-term-per-topic, default) or gamma (per-document-per-topic) matrix. |
log |
Whether beta/gamma should be on a log scale, default FALSE |
normalized |
If true (default), normalize so that each document or word sums to one across the topics. If false, values will be integers representing the actual number of word-topic or document-topic assignments. |
smoothed |
If true (default), add the smoothing parameter to each
to avoid any values being zero. This smoothing parameter is initialized
as |
... |
Extra arguments, not used |
data |
For |
Details
Note that the LDA models from mallet::MalletLDA()
are technically a special case of S4 objects with class jobjRef
.
These are thus implemented as jobjRef
tidiers, with a check for
whether the toString
output is as expected.
Value
augment
must be provided a data argument containing
one row per original document-term pair, such as is returned by
tdm_tidiers, containing columns document
and term
.
It returns that same data with an additional column
.topic
with the topic assignment for that document-term combination.
See Also
lda_tidiers()
, mallet::mallet.doc.topics()
,
mallet::mallet.topic.words()
Examples
## Not run:
library(mallet)
library(dplyr)
data("AssociatedPress", package = "topicmodels")
td <- tidy(AssociatedPress)
# mallet needs a file with stop words
tmp <- tempfile()
writeLines(stop_words$word, tmp)
# two vectors: one with document IDs, one with text
docs <- td %>%
group_by(document = as.character(document)) %>%
summarize(text = paste(rep(term, count), collapse = " "))
docs <- mallet.import(docs$document, docs$text, tmp)
# create and run a topic model
topic_model <- MalletLDA(num.topics = 4)
topic_model$loadDocuments(docs)
topic_model$train(20)
# tidy the word-topic combinations
td_beta <- tidy(topic_model)
td_beta
# Examine the four topics
td_beta %>%
group_by(topic) %>%
top_n(8, beta) %>%
ungroup() %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta)) +
geom_col() +
facet_wrap(~ topic, scales = "free") +
coord_flip()
# find the assignments of each word in each document
assignments <- augment(topic_model, td)
assignments
## End(Not run)