R: Pairwise Jaccard Coefficients

jaccardTopics {ldaPrototype}

R Documentation

Pairwise Jaccard Coefficients

Description

Calculates the similarity of all pairwise topic combinations using a modified Jaccard Coefficient.

Usage

jaccardTopics(
  topics,
  limit.rel,
  limit.abs,
  atLeast,
  progress = TRUE,
  pm.backend,
  ncpus
)

Arguments

`topics`	[`named matrix`] The counts of vocabularies/words (row wise) in topics (column wise).
`limit.rel`	[0,1] A relative lower bound limit for which words are taken into account. Those words are taken as relevant for a topic that have a count higher than `limit.rel` multiplied by the total count of the given topic. Default is `1/500`.
`limit.abs`	[`integer(1)`] An absolute lower bound limit for which words are taken into account. All words are taken as relevant for a topic that have a count higher than `limit.abs`. Default is `10`.
`atLeast`	[`integer(1)`] An absolute count of how many words are at least considered as relevant for a topic. Default is `0`.
`progress`	[`logical(1)`] Should a nice progress bar be shown? Turning it off, could lead to significantly faster calculation. Default is `TRUE`. If `pm.backend` is set, parallelization is done and no progress bar will be shown.
`pm.backend`	[`character(1)`] One of "multicore", "socket" or "mpi". If `pm.backend` is set, `parallelStart` is called before computation is started and `parallelStop` is called after.
`ncpus`	[`integer(1)`] Number of (physical) CPUs to use. If `pm.backend` is passed, default is determined by `availableCores`.

Details

The modified Jaccard Coefficient for two topics \bm z_{i} and \bm z_{j} is calculated by

J_m(\bm z_{i}, \bm z_{j} \mid \bm c) = \frac{\sum_{v = 1}^{V} 1_{\left\{n_{i}^{(v)} > c_i ~\wedge~ n_{j}^{(v)} > c_j\right\}}\left(n_{i}^{(v)}, n_{j}^{(v)}\right)}{\sum_{v = 1}^{V} 1_{\left\{n_{i}^{(v)} > c_i ~\vee~ n_{j}^{(v)} > c_j\right\}}\left(n_{i}^{(v)}, n_{j}^{(v)}\right)}

with V is the vocabulary size and n_k^{(v)} is the count of assignments of the v-th word to the k-th topic. The threshold vector \bm c is determined by the maximum threshold of the user given lower bounds limit.rel and limit.abs. In addition, at least atLeast words per topic are considered for calculation. According to this, if there are less than atLeast words considered as relevant after applying limit.rel and limit.abs the atLeast most common words per topic are taken to determine topic similarities.

The procedure of determining relevant words is executed for each topic individually. The values wordslimit and wordsconsidered describes the number of relevant words per topic.

Value

[named list] with entries

sims: [lower triangular named matrix] with all pairwise jaccard similarities of the given topics.
wordslimit: [integer] with counts of words determined as relevant based on limit.rel and limit.abs.
wordsconsidered: [integer] with counts of considered words for similarity calculation. Could differ from wordslimit, if atLeast is greater than zero.
param: [named list] with parameter specifications for type [character(1)] = "Jaccard Coefficient", limit.rel [0,1], limit.abs [integer(1)] and atLeast [integer(1)]. See above for explanation.

Examples

res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
jacc = jaccardTopics(topics, atLeast = 2)
jacc

n1 = getConsideredWords(jacc)
n2 = getRelevantWords(jacc)
(n1 - n2)[n1 - n2 != 0]

sim = getSimilarity(jacc)
dim(sim)

# Comparison to Cosine and Jensen-Shannon (more interesting on large datasets)
cosine = cosineTopics(topics)
js = jsTopics(topics)

sims = list(jaccard = sim, cosine = getSimilarity(cosine), js = getSimilarity(js))
pairs(do.call(cbind, lapply(sims, as.vector)))

[Package ldaPrototype version 0.3.1 Index]