calc_prob_coherence {tidylda}R Documentation

Probabilistic coherence of topics


Calculates the probabilistic coherence of a topic or topics. This approximates semantic coherence or human understandability of a topic.


calc_prob_coherence(beta, data, m = 5)



A numeric matrix or a numeric vector. The vector, or rows of the matrix represent the numeric relationship between topic(s) and terms. For example, this relationship may be p(word|topic) or p(topic|word).


A document term matrix or term co-occurrence matrix. The preferred class is a dgCMatrix-class. However there is support for any Matrix-class object as well as several other commonly-used classes such as matrix, dfm, DocumentTermMatrix, and simple_triplet_matrix


An integer for the number of words to be used in the calculation. Defaults to 5


For each pair of words {a, b} in the top M words in a topic, probabilistic coherence calculates P(b|a) - P(b), where {a} is more probable than {b} in the topic. For example, suppose the top 4 words in a topic are {a, b, c, d}. Then, we calculate 1. P(a|b) - P(b), P(a|c) - P(c), P(a|d) - P(d) 2. P(b|c) - P(c), P(b|d) - P(d) 3. P(c|d) - P(d) All 6 differences are averaged together.


Returns an object of class numeric corresponding to the probabilistic coherence of the input topic(s).


# Load a pre-formatted dtm and topic model

# fit a model
model <- tidylda(
  data = nih_sample_dtm[1:20, ], k = 5,
  iterations = 100, burnin = 50

calc_prob_coherence(beta = model$beta, data = nih_sample_dtm, m = 5)

[Package tidylda version 0.0.5 Index]