mallet.subset.topic.words {mallet} | R Documentation |
Estimate topic-word distributions from a sub-corpus
Description
This function returns a matrix of word probabilities for each topic similar to
mallet.topic.words
, but estimated from a subset of the documents
in the corpus. The model assumes that topics are the same no matter where they
are used, but we know this is often not the case. This function lets us test
whether some words are used more or less than we expect in a particular set
of documents.
Usage
mallet.subset.topic.words(
topic.model,
subset.docs,
normalized = FALSE,
smoothed = FALSE
)
Arguments
topic.model |
A |
subset.docs |
A logical vector of |
normalized |
If |
smoothed |
If |
Value
a number of topics by vocabulary size matrix for the the included documents.
See Also
Examples
## Not run:
# Read in sotu example data
data(sotu)
sotu.instances <-
mallet.import(id.array = row.names(sotu),
text.array = sotu[["text"]],
stoplist = mallet_stoplist_file_path("en"),
token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")
# Create topic model
topic.model <- MalletLDA(num.topics=10, alpha.sum = 1, beta = 0.1)
topic.model$loadDocuments(sotu.instances)
# Train topic model
topic.model$train(200)
# Extract subcorpus topic word matrix
post1975_topic_words <- mallet.subset.topic.words(topic.model, sotu[["year"]] > 1975)
mallet.top.words(topic.model, word.weights = post1975_topic_words[2,], num.top.words = 5)
## End(Not run)