lda_tidiers {tidytext} | R Documentation |
Tidiers for LDA and CTM objects from the topicmodels package
Description
Tidy the results of a Latent Dirichlet Allocation or Correlated Topic Model.
Usage
## S3 method for class 'LDA'
tidy(x, matrix = c("beta", "gamma"), log = FALSE, ...)
## S3 method for class 'CTM'
tidy(x, matrix = c("beta", "gamma"), log = FALSE, ...)
## S3 method for class 'LDA'
augment(x, data, ...)
## S3 method for class 'CTM'
augment(x, data, ...)
## S3 method for class 'LDA'
glance(x, ...)
## S3 method for class 'CTM'
glance(x, ...)
Arguments
x |
An LDA or CTM (or LDA_VEM/CTA_VEM) object from the topicmodels package |
matrix |
Whether to tidy the beta (per-term-per-topic, default) or gamma (per-document-per-topic) matrix |
log |
Whether beta/gamma should be on a log scale, default FALSE |
... |
Extra arguments, not used |
data |
For |
Value
tidy
returns a tidied version of either the beta or gamma matrix.
If matrix == "beta"
(default), returns a table with one row per topic and term,
with columns
- topic
Topic, as an integer
- term
Term
- beta
Probability of a term generated from a topic according to the multinomial model
If matrix == "gamma"
, returns a table with one row per topic and document,
with columns
- topic
Topic, as an integer
- document
Document name or ID
- gamma
Probability of topic given document
augment
returns a table with one row per original
document-term pair, such as is returned by tdm_tidiers:
- document
Name of document (if present), or index
- term
Term
- .topic
Topic assignment
If the data
argument is provided, any columns in the original
data are included, combined based on the document
and term
columns.
glance
always returns a one-row table, with columns
- iter
Number of iterations used
- terms
Number of terms in the model
- alpha
If an LDA_VEM, the parameter of the Dirichlet distribution for topics over documents
Examples
if (requireNamespace("topicmodels", quietly = TRUE)) {
set.seed(2016)
library(dplyr)
library(topicmodels)
data("AssociatedPress", package = "topicmodels")
ap <- AssociatedPress[1:100, ]
lda <- LDA(ap, control = list(alpha = 0.1), k = 4)
# get term distribution within each topic
td_lda <- tidy(lda)
td_lda
library(ggplot2)
# visualize the top terms within each topic
td_lda_filtered <- td_lda %>%
filter(beta > .004) %>%
mutate(term = reorder(term, beta))
ggplot(td_lda_filtered, aes(term, beta)) +
geom_bar(stat = "identity") +
facet_wrap(~ topic, scales = "free") +
theme(axis.text.x = element_text(angle = 90, size = 15))
# get classification of each document
td_lda_docs <- tidy(lda, matrix = "gamma")
td_lda_docs
doc_classes <- td_lda_docs %>%
group_by(document) %>%
top_n(1) %>%
ungroup()
doc_classes
# which were we most uncertain about?
doc_classes %>%
arrange(gamma)
}