coherence {text2vec} | R Documentation |
Coherence metrics for topic models
Description
Given a topic model with topics represented as ordered term lists, the coherence may be used to assess the quality of individual topics.
This function is an implementation of several of the numerous possible metrics for such kind of assessments.
Coherence calculation is sensitive to the content of the reference tcm
that is used for evaluation
and that may be created with different parameter settings. Please refer to the details section (or reference section) for information
on typical combinations of metric and type of tcm
. For more general information on measuring coherence
a starting point is given in the reference section.
Usage
coherence(x, tcm, metrics = c("mean_logratio", "mean_pmi", "mean_npmi",
"mean_difference", "mean_npmi_cosim", "mean_npmi_cosim2"),
smooth = 1e-12, n_doc_tcm = -1)
Arguments
x |
A |
tcm |
The term co-occurrence matrix, e.g, a |
metrics |
Character vector specifying the metrics to be calculated. Currently the following metrics are implemented:
|
smooth |
Numeric smoothing constant to avoid logarithm of zero. By default, set to |
n_doc_tcm |
The |
Details
The currently implemented coherence metrics
are described below including a description of the
content type of the tcm
that showed good performance in combination with a specific metric.
For details on how to create tcm
see the example section.
For details on performance of metrics see the resources in the reference section
that served for definition of standard settings for individual metrics.
Note that depending on the use case, still, different settings than the standard settings for creation of tcm
may be reasonable.
Note that for all currently implemented metrics the tcm
is reduced to the top word space on basis of the terms in x
.
Considering the use case of finding the optimum number of topics among several models with different metrics, calculating the mean score over all topics and normalizing this mean coherence scores from different metrics might be considered for direct comparison.
Each metric usually opts for a different optimum number of topics. From initial experience it may be assumed that logratio, pmi and nmpi usually opt for smaller numbers, whereas the other metrics rather tend to propose higher numbers.
Implemented metrics:
"mean_logratio"
The logarithmic ratio is calculated as
log(smooth + tcm[x,y]) - log(tcm[y,y])
,
where x and y are term index pairs from a "preceding" term index combination.
Given the indices c(1,2,3), combinations arelist(c(2,1), c(3,1), c(3,2))
.
Thetcm
should represent the boolean term co-occurrence (internally the actual counts are used) in the original documents and, therefore, is an intrinsic metric in the standard use case.
This metric is similar to the UMass metric, however, with a smaller smoothing constant by default and using the mean for aggregation instead of the sum.
"mean_pmi"
The pointwise mutual information is calculated as
log2((tcm[x,y]/n_doc_tcm) + smooth) - log2(tcm[x,x]/n_doc_tcm) - log2(tcm[y,y]/n_doc_tcm)
,
where x and y are term index pairs from an arbitrary term index combination
that subsets the lower or upper triangle oftcm
, e.g. "preceding".
Thetcm
should represent term co-occurrences within a boolean sliding window of size10
(internally probabilities are used) in an external reference corpus and, therefore, is an extrinsic metric in the standard use case.
This metric is similar to the UCI metric, however, with a smaller smoothing constant by default and using the mean for aggregation instead of the sum."mean_npmi"
Similar (in terms of all parameter settings, etc.) to "mean_pmi" metric but using the normalized pmi instead, which is calculated as
(log2((tcm[x,y]/n_doc_tcm) + smooth) - log2(tcm[x,x]/n_doc_tcm) - log2(tcm[y,y]/n_doc_tcm)) / -log2((tcm[x,y]/n_doc_tcm) + smooth)
,
This metric may perform better than the simpler pmi metric."mean_difference"
The difference is calculated as
tcm[x,y]/tcm[x,x] - (tcm[y,y]/n_tcm_windows)
,
where x and y are term index pairs from a "preceding" term index combination.
Given the indices c(1,2,3), combinations arelist(c(1,2), c(1,3), c(2,3))
.
Thetcm
should represent the boolean term co-occurrence (internally probabilities are used) in the original documents and, therefore, is an intrinsic metric in the standard use case."mean_npmi_cosim"
First, the npmi of an individual top word with each of the top words is calculated as in "mean_npmi".
This result in a vector of npmi values for each top word.
On this basis, the cosine similarity between each pair of vectors is calculated.
Thetcm
should represent term co-occurrences within a boolean sliding window of size5
(internally probabilities are used) in an external reference corpus and, therefore, is an extrinsic metric in the standard use case.
"mean_npmi_cosim2"
First, a vector of npmi values for each top word is calculated as in "mean_npmi_cosim".
On this basis, the cosine similarity between each vector and the sum of all vectors is calculated (instead of the similarity between each pair).
Thetcm
should represent term co-occurrences within a boolean sliding window of size110
(internally probabilities are used) in an external reference corpus and, therefore, is an extrinsic metric in the standard use case.
Value
A numeric matrix
with the coherence scores of the specified metrics
per topic.
References
Below mentioned paper is the main theoretical basis for this code.
Currently only a selection of metrics stated in this paper is included in this R implementation.
Authors: Roeder, Michael; Both, Andreas; Hinneburg, Alexander (2015)
Title: Exploring the Space of Topic Coherence Measures.
In: Xueqi Cheng, Hang Li, Evgeniy Gabrilovich und Jie Tang (Eds.):
Proceedings of the Eighth ACM International Conference on Web Search and Data Mining - WSDM '15.
the Eighth ACM International Conference. Shanghai, China, 02.02.2015 - 06.02.2015.
New York, USA: ACM Press, p. 399-408.
https://dl.acm.org/citation.cfm?id=2685324
This paper has been implemented by above listed authors as the Java program "palmetto".
See https://github.com/dice-group/Palmetto or http://aksw.org/Projects/Palmetto.html.
Examples
## Not run:
library(data.table)
library(text2vec)
library(Matrix)
data("movie_review")
N = 500
tokens = word_tokenizer(tolower(movie_review$review[1:N]))
it = itoken(tokens, progressbar = FALSE)
v = create_vocabulary(it)
v = prune_vocabulary(v, term_count_min = 5, doc_proportion_max = 0.2)
dtm = create_dtm(it, vocab_vectorizer(v))
n_topics = 10
lda_model = LDA$new(n_topics)
fitted = lda_model$fit_transform(dtm, n_iter = 20)
tw = lda_model$get_top_words(n = 10, lambda = 1)
# for demonstration purposes create intrinsic TCM from original documents
# scores might not make sense for metrics that are designed for extrinsic TCM
tcm = crossprod(sign(dtm))
# check coherence
logger = lgr::get_logger('text2vec')
logger$set_threshold('debug')
res = coherence(tw, tcm, n_doc_tcm = N)
res
# example how to create TCM for extrinsic measures from an external corpus
external_reference_corpus = tolower(movie_review$review[501:1000])
tokens_ext = word_tokenizer(external_reference_corpus)
iterator_ext = itoken(tokens_ext, progressbar = FALSE)
v_ext = create_vocabulary(iterator_ext)
# for reasons of efficiency vocabulary may be reduced to the terms matched in the original corpus
v_ext= v_ext[v_ext$term %in% v$term, ]
# external vocabulary may be pruned depending on the use case
v_ext = prune_vocabulary(v_ext, term_count_min = 5, doc_proportion_max = 0.2)
vectorizer_ext = vocab_vectorizer(v_ext)
# for demonstration purposes a boolean co-occurrence within sliding window of size 10 is used
# 10 represents sentence co-occurrence, a size of 110 would, e.g., be paragraph co-occurrence
window_size = 5
tcm_ext = create_tcm(iterator_ext, vectorizer_ext
,skip_grams_window = window_size
,weights = rep(1, window_size)
,binary_cooccurence = TRUE
)
#add marginal probabilities in diagonal (by default only upper triangle of tcm is created)
diag(tcm_ext) = attributes(tcm_ext)$word_count
# get number of sliding windows that serve as virtual documents, i.e. n_doc_tcm argument
n_skip_gram_windows = sum(sapply(tokens_ext, function(x) {length(x)}))
## End(Not run)