calc_prob_coherence {tidylda} | R Documentation |
Probabilistic coherence of topics
Description
Calculates the probabilistic coherence of a topic or topics. This approximates semantic coherence or human understandability of a topic.
Usage
calc_prob_coherence(beta, data, m = 5)
Arguments
beta |
A numeric matrix or a numeric vector. The vector, or rows of the matrix represent the numeric relationship between topic(s) and terms. For example, this relationship may be p(word|topic) or p(topic|word). |
data |
A document term matrix or term co-occurrence matrix. The preferred
class is a |
m |
An integer for the number of words to be used in the calculation. Defaults to 5 |
Details
For each pair of words {a, b} in the top M words in a topic, probabilistic coherence calculates P(b|a) - P(b), where {a} is more probable than {b} in the topic. For example, suppose the top 4 words in a topic are {a, b, c, d}. Then, we calculate 1. P(a|b) - P(b), P(a|c) - P(c), P(a|d) - P(d) 2. P(b|c) - P(c), P(b|d) - P(d) 3. P(c|d) - P(d) All 6 differences are averaged together.
Value
Returns an object of class numeric
corresponding to the
probabilistic coherence of the input topic(s).
Examples
# Load a pre-formatted dtm and topic model
data(nih_sample_dtm)
# fit a model
set.seed(12345)
model <- tidylda(
data = nih_sample_dtm[1:20, ], k = 5,
iterations = 100, burnin = 50
)
calc_prob_coherence(beta = model$beta, data = nih_sample_dtm, m = 5)