R: Pairwise Jensen-Shannon Similarities (Divergences)

jsTopics {ldaPrototype}

R Documentation

Pairwise Jensen-Shannon Similarities (Divergences)

Description

Calculates the similarity of all pairwise topic combinations using the Jensen-Shannon Divergence.

Usage

jsTopics(topics, epsilon = 1e-06, progress = TRUE, pm.backend, ncpus)

Arguments

`topics`	[`named matrix`] The counts of vocabularies/words (row wise) in topics (column wise).
`epsilon`	[`numeric(1)`] Numerical value added to `topics` to ensure computability. See details. Default is `1e-06`.
`progress`	[`logical(1)`] Should a nice progress bar be shown? Turning it off, could lead to significantly faster calculation. Default is `TRUE`. If `pm.backend` is set, parallelization is done and no progress bar will be shown.
`pm.backend`	[`character(1)`] One of "multicore", "socket" or "mpi". If `pm.backend` is set, `parallelStart` is called before computation is started and `parallelStop` is called after.
`ncpus`	[`integer(1)`] Number of (physical) CPUs to use. If `pm.backend` is passed, default is determined by `availableCores`.

Details

The Jensen-Shannon Similarity for two topics \bm z_{i} and \bm z_{j} is calculated by

JS(\bm z_{i}, \bm z_{j}) = 1 - \left( KLD\left(\bm p_i, \frac{\bm p_i + \bm p_j}{2}\right) + KLD\left(\bm p_j, \frac{\bm p_i + \bm p_j}{2}\right) \right)/2

= 1 - KLD(\bm p_i, \bm p_i + \bm p_j)/2 - KLD(\bm p_j, \bm p_i + \bm p_j)/2 - \log(2)

with V is the vocabulary size, \bm p_k = \left(p_k^{(1)}, ..., p_k^{(V)}\right), and p_k^{(v)} is the proportion of assignments of the v-th word to the k-th topic. KLD defines the Kullback-Leibler Divergence calculated by

KLD(\bm p_{k}, \bm p_{\Sigma}) = \sum_{v=1}^{V} p_k^{(v)} \log{\frac{p_k^{(v)}}{p_{\Sigma}^{(v)}}}.

There is an epsilon added to every n_k^{(v)}, the count (not proportion) of assignments to ensure computability with respect to zeros.

Value

[named list] with entries

sims: [lower triangular named matrix] with all pairwise similarities of the given topics.
wordslimit: [integer] = vocabulary size. See jaccardTopics for original purpose.
wordsconsidered: [integer] = vocabulary size. See jaccardTopics for original purpose.
param: [named list] with parameter specifications for type [character(1)] = "Cosine Similarity" and epsilon [numeric(1)]. See above for explanation.

Examples

res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
js = jsTopics(topics)
js

sim = getSimilarity(js)
dim(sim)

js1 = jsTopics(topics, epsilon = 1)
sim1 = getSimilarity(js1)
summary((sim1-sim)[lower.tri(sim)])
plot(sim, sim1, xlab = "epsilon = 1e-6", ylab = "epsilon = 1")

[Package ldaPrototype version 0.3.1 Index]