R: Semisupervised Latent Dirichlet allocation

textmodel_seededlda {seededlda}

R Documentation

Semisupervised Latent Dirichlet allocation

Description

Implements semisupervised Latent Dirichlet allocation (Seeded LDA). textmodel_seededlda() allows users to specify topics using a seed word dictionary. Users can run Seeded Sequential LDA by setting gamma > 0.

Usage

textmodel_seededlda(
  x,
  dictionary,
  levels = 1,
  valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE,
  residual = 0,
  weight = 0.01,
  max_iter = 2000,
  auto_iter = FALSE,
  alpha = 0.5,
  beta = 0.1,
  gamma = 0,
  batch_size = 1,
  ...,
  verbose = quanteda_options("verbose")
)

Arguments

`x`	the dfm on which the model will be fit.
`dictionary`	a `quanteda::dictionary()` with seed words that define topics.
`levels`	levels of entities in a hierarchical dictionary to be used as seed words. See also quanteda::flatten_dictionary.
`valuetype`	see quanteda::valuetype
`case_insensitive`	see quanteda::valuetype
`residual`	the number of undefined topics. They are named "other" by default, but it can be changed via `base::options(seededlda_residual_name)`.
`weight`	determines the size of pseudo counts given to matched seed words.
`max_iter`	the maximum number of iteration in Gibbs sampling.
`auto_iter`	if `TRUE`, stops Gibbs sampling on convergence before reaching `max_iter`. See details.
`alpha`	the values to smooth topic-document distribution.
`beta`	the values to smooth topic-word distribution.
`gamma`	a parameter to determine change of topics between sentences or paragraphs. When `gamma > 0`, Gibbs sampling of topics for the current document is affected by the previous document's topics.
`batch_size`	split the corpus into the smaller batches (specified in proportion) for distributed computing; it is disabled when a batch include all the documents `batch_size = 1.0`. See details.
`...`	passed to quanteda::dfm_trim to restrict seed words based on their term or document frequency. This is useful when glob patterns in the dictionary match too many words.
`verbose`	logical; if `TRUE` print diagnostic information during fitting.

Value

The same as textmodel_lda() with extra elements for dictionary.

References

Lu, Bin et al. (2011). "Multi-aspect Sentiment Analysis with Topic Models". doi:10.5555/2117693.2119585. Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops.

Watanabe, Kohei & Zhou, Yuan. (2020). "Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches". doi:10.1177/0894439320907027. Social Science Computer Review.

Watanabe, Kohei & Baturo, Alexander. (2023). "Seeded Sequential LDA: A Semi-supervised Algorithm for Topic-specific Analysis of Sentences". doi:10.1177/08944393231178605. Social Science Computer Review.

Examples


require(seededlda)
require(quanteda)

corp <- head(data_corpus_moviereviews, 500)
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_number = TRUE)
dfmt <- dfm(toks) %>%
    dfm_remove(stopwords("en"), min_nchar = 2) %>%
    dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")

dict <- dictionary(list(people = c("family", "couple", "kids"),
                        space = c("alien", "planet", "space"),
                        moster = c("monster*", "ghost*", "zombie*"),
                        war = c("war", "soldier*", "tanks"),
                        crime = c("crime*", "murder", "killer")))
lda_seed <- textmodel_seededlda(dfmt, dict, residual = TRUE, min_termfreq = 10,
                                max_iter = 500)
terms(lda_seed)
topics(lda_seed)

[Package seededlda version 1.3.2 Index]