jsTopics {ldaPrototype}R Documentation

Pairwise Jensen-Shannon Similarities (Divergences)

Description

Calculates the similarity of all pairwise topic combinations using the Jensen-Shannon Divergence.

Usage

jsTopics(topics, epsilon = 1e-06, progress = TRUE, pm.backend, ncpus)

Arguments

topics

[named matrix]
The counts of vocabularies/words (row wise) in topics (column wise).

epsilon

[numeric(1)]
Numerical value added to topics to ensure computability. See details. Default is 1e-06.

progress

[logical(1)]
Should a nice progress bar be shown? Turning it off, could lead to significantly faster calculation. Default is TRUE. If pm.backend is set, parallelization is done and no progress bar will be shown.

pm.backend

[character(1)]
One of "multicore", "socket" or "mpi". If pm.backend is set, parallelStart is called before computation is started and parallelStop is called after.

ncpus

[integer(1)]
Number of (physical) CPUs to use. If pm.backend is passed, default is determined by availableCores.

Details

The Jensen-Shannon Similarity for two topics \bm z_{i} and \bm z_{j} is calculated by

JS(\bm z_{i}, \bm z_{j}) = 1 - \left( KLD\left(\bm p_i, \frac{\bm p_i + \bm p_j}{2}\right) + KLD\left(\bm p_j, \frac{\bm p_i + \bm p_j}{2}\right) \right)/2

= 1 - KLD(\bm p_i, \bm p_i + \bm p_j)/2 - KLD(\bm p_j, \bm p_i + \bm p_j)/2 - \log(2)

with V is the vocabulary size, \bm p_k = \left(p_k^{(1)}, ..., p_k^{(V)}\right), and p_k^{(v)} is the proportion of assignments of the v-th word to the k-th topic. KLD defines the Kullback-Leibler Divergence calculated by

KLD(\bm p_{k}, \bm p_{\Sigma}) = \sum_{v=1}^{V} p_k^{(v)} \log{\frac{p_k^{(v)}}{p_{\Sigma}^{(v)}}}.

There is an epsilon added to every n_k^{(v)}, the count (not proportion) of assignments to ensure computability with respect to zeros.

Value

[named list] with entries

sims

[lower triangular named matrix] with all pairwise similarities of the given topics.

wordslimit

[integer] = vocabulary size. See jaccardTopics for original purpose.

wordsconsidered

[integer] = vocabulary size. See jaccardTopics for original purpose.

param

[named list] with parameter specifications for type [character(1)] = "Cosine Similarity" and epsilon [numeric(1)]. See above for explanation.

See Also

Other TopicSimilarity functions: cosineTopics(), dendTopics(), getSimilarity(), jaccardTopics(), rboTopics()

Examples

res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
js = jsTopics(topics)
js

sim = getSimilarity(js)
dim(sim)

js1 = jsTopics(topics, epsilon = 1)
sim1 = getSimilarity(js1)
summary((sim1-sim)[lower.tri(sim)])
plot(sim, sim1, xlab = "epsilon = 1e-6", ylab = "epsilon = 1")


[Package ldaPrototype version 0.3.1 Index]