textmodel_seededlda {seededlda} | R Documentation |
Semisupervised Latent Dirichlet allocation
Description
Implements semisupervised Latent Dirichlet allocation
(Seeded LDA). textmodel_seededlda()
allows users to specify
topics using a seed word dictionary. Users can run Seeded Sequential LDA by
setting gamma > 0
.
Usage
textmodel_seededlda(
x,
dictionary,
levels = 1,
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
residual = 0,
weight = 0.01,
max_iter = 2000,
auto_iter = FALSE,
alpha = 0.5,
beta = 0.1,
gamma = 0,
batch_size = 1,
...,
verbose = quanteda_options("verbose")
)
Arguments
x |
the dfm on which the model will be fit. |
dictionary |
a |
levels |
levels of entities in a hierarchical dictionary to be used as seed words. See also quanteda::flatten_dictionary. |
valuetype |
|
case_insensitive |
|
residual |
the number of undefined topics. They are named "other" by
default, but it can be changed via |
weight |
determines the size of pseudo counts given to matched seed words. |
max_iter |
the maximum number of iteration in Gibbs sampling. |
auto_iter |
if |
alpha |
the values to smooth topic-document distribution. |
beta |
the values to smooth topic-word distribution. |
gamma |
a parameter to determine change of topics between sentences or
paragraphs. When |
batch_size |
split the corpus into the smaller batches (specified in
proportion) for distributed computing; it is disabled when a batch include
all the documents |
... |
passed to quanteda::dfm_trim to restrict seed words based on their term or document frequency. This is useful when glob patterns in the dictionary match too many words. |
verbose |
logical; if |
Value
The same as textmodel_lda()
with extra elements for dictionary
.
References
Lu, Bin et al. (2011). "Multi-aspect Sentiment Analysis with Topic Models". doi:10.5555/2117693.2119585. Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops.
Watanabe, Kohei & Zhou, Yuan. (2020). "Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches". doi:10.1177/0894439320907027. Social Science Computer Review.
Watanabe, Kohei & Baturo, Alexander. (2023). "Seeded Sequential LDA: A Semi-supervised Algorithm for Topic-specific Analysis of Sentences". doi:10.1177/08944393231178605. Social Science Computer Review.
See Also
Examples
require(seededlda)
require(quanteda)
corp <- head(data_corpus_moviereviews, 500)
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_number = TRUE)
dfmt <- dfm(toks) %>%
dfm_remove(stopwords("en"), min_nchar = 2) %>%
dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")
dict <- dictionary(list(people = c("family", "couple", "kids"),
space = c("alien", "planet", "space"),
moster = c("monster*", "ghost*", "zombie*"),
war = c("war", "soldier*", "tanks"),
crime = c("crime*", "murder", "killer")))
lda_seed <- textmodel_seededlda(dfmt, dict, residual = TRUE, min_termfreq = 10,
max_iter = 500)
terms(lda_seed)
topics(lda_seed)