jsTopics {ldaPrototype}R Documentation

Pairwise Jensen-Shannon Similarities (Divergences)

Description

Calculates the similarity of all pairwise topic combinations using the Jensen-Shannon Divergence.

Usage

jsTopics(topics, epsilon = 1e-06, progress = TRUE, pm.backend, ncpus)

Arguments

topics

[named matrix]
The counts of vocabularies/words (row wise) in topics (column wise).

epsilon

[numeric(1)]
Numerical value added to topics to ensure computability. See details. Default is 1e-06.

progress

[logical(1)]
Should a nice progress bar be shown? Turning it off, could lead to significantly faster calculation. Default is TRUE. If pm.backend is set, parallelization is done and no progress bar will be shown.

pm.backend

[character(1)]
One of "multicore", "socket" or "mpi". If pm.backend is set, parallelStart is called before computation is started and parallelStop is called after.

ncpus

[integer(1)]
Number of (physical) CPUs to use. If pm.backend is passed, default is determined by availableCores.

Details

The Jensen-Shannon Similarity for two topics zi\bm z_{i} and zj\bm z_{j} is calculated by

JS(zi,zj)=1(KLD(pi,pi+pj2)+KLD(pj,pi+pj2))/2JS(\bm z_{i}, \bm z_{j}) = 1 - \left( KLD\left(\bm p_i, \frac{\bm p_i + \bm p_j}{2}\right) + KLD\left(\bm p_j, \frac{\bm p_i + \bm p_j}{2}\right) \right)/2

=1KLD(pi,pi+pj)/2KLD(pj,pi+pj)/2log(2)= 1 - KLD(\bm p_i, \bm p_i + \bm p_j)/2 - KLD(\bm p_j, \bm p_i + \bm p_j)/2 - \log(2)

with VV is the vocabulary size, pk=(pk(1),...,pk(V))\bm p_k = \left(p_k^{(1)}, ..., p_k^{(V)}\right), and pk(v)p_k^{(v)} is the proportion of assignments of the vv-th word to the kk-th topic. KLD defines the Kullback-Leibler Divergence calculated by

KLD(pk,pΣ)=v=1Vpk(v)logpk(v)pΣ(v).KLD(\bm p_{k}, \bm p_{\Sigma}) = \sum_{v=1}^{V} p_k^{(v)} \log{\frac{p_k^{(v)}}{p_{\Sigma}^{(v)}}}.

There is an epsilon added to every nk(v)n_k^{(v)}, the count (not proportion) of assignments to ensure computability with respect to zeros.

Value

[named list] with entries

sims

[lower triangular named matrix] with all pairwise similarities of the given topics.

wordslimit

[integer] = vocabulary size. See jaccardTopics for original purpose.

wordsconsidered

[integer] = vocabulary size. See jaccardTopics for original purpose.

param

[named list] with parameter specifications for type [character(1)] = "Cosine Similarity" and epsilon [numeric(1)]. See above for explanation.

See Also

Other TopicSimilarity functions: cosineTopics(), dendTopics(), getSimilarity(), jaccardTopics(), rboTopics()

Examples

res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
js = jsTopics(topics)
js

sim = getSimilarity(js)
dim(sim)

js1 = jsTopics(topics, epsilon = 1)
sim1 = getSimilarity(js1)
summary((sim1-sim)[lower.tri(sim)])
plot(sim, sim1, xlab = "epsilon = 1e-6", ylab = "epsilon = 1")


[Package ldaPrototype version 0.3.1 Index]