textmodel_seededlda {seededlda}R Documentation

Semisupervised Latent Dirichlet allocation

Description

Implements semisupervised Latent Dirichlet allocation (Seeded LDA). textmodel_seededlda() allows users to specify topics using a seed word dictionary. Users can run Seeded Sequential LDA by setting gamma > 0.

Usage

textmodel_seededlda(
  x,
  dictionary,
  levels = 1,
  valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE,
  residual = 0,
  weight = 0.01,
  max_iter = 2000,
  auto_iter = FALSE,
  alpha = 0.5,
  beta = 0.1,
  gamma = 0,
  batch_size = 1,
  ...,
  verbose = quanteda_options("verbose")
)

Arguments

x

the dfm on which the model will be fit.

dictionary

a quanteda::dictionary() with seed words that define topics.

levels

levels of entities in a hierarchical dictionary to be used as seed words. See also quanteda::flatten_dictionary.

valuetype

see quanteda::valuetype

case_insensitive

see quanteda::valuetype

residual

the number of undefined topics. They are named "other" by default, but it can be changed via base::options(seededlda_residual_name).

weight

determines the size of pseudo counts given to matched seed words.

max_iter

the maximum number of iteration in Gibbs sampling.

auto_iter

if TRUE, stops Gibbs sampling on convergence before reaching max_iter. See details.

alpha

the values to smooth topic-document distribution.

beta

the values to smooth topic-word distribution.

gamma

a parameter to determine change of topics between sentences or paragraphs. When gamma > 0, Gibbs sampling of topics for the current document is affected by the previous document's topics.

batch_size

split the corpus into the smaller batches (specified in proportion) for distributed computing; it is disabled when a batch include all the documents batch_size = 1.0. See details.

...

passed to quanteda::dfm_trim to restrict seed words based on their term or document frequency. This is useful when glob patterns in the dictionary match too many words.

verbose

logical; if TRUE print diagnostic information during fitting.

Value

The same as textmodel_lda() with extra elements for dictionary.

References

Lu, Bin et al. (2011). "Multi-aspect Sentiment Analysis with Topic Models". doi:10.5555/2117693.2119585. Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops.

Watanabe, Kohei & Zhou, Yuan. (2020). "Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches". doi:10.1177/0894439320907027. Social Science Computer Review.

Watanabe, Kohei & Baturo, Alexander. (2023). "Seeded Sequential LDA: A Semi-supervised Algorithm for Topic-specific Analysis of Sentences". doi:10.1177/08944393231178605. Social Science Computer Review.

See Also

keyATM

Examples


require(seededlda)
require(quanteda)

corp <- head(data_corpus_moviereviews, 500)
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_number = TRUE)
dfmt <- dfm(toks) %>%
    dfm_remove(stopwords("en"), min_nchar = 2) %>%
    dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")

dict <- dictionary(list(people = c("family", "couple", "kids"),
                        space = c("alien", "planet", "space"),
                        moster = c("monster*", "ghost*", "zombie*"),
                        war = c("war", "soldier*", "tanks"),
                        crime = c("crime*", "murder", "killer")))
lda_seed <- textmodel_seededlda(dfmt, dict, residual = TRUE, min_termfreq = 10,
                                max_iter = 500)
terms(lda_seed)
topics(lda_seed)


[Package seededlda version 1.3.2 Index]