jaccardTopics {ldaPrototype} | R Documentation |
Pairwise Jaccard Coefficients
Description
Calculates the similarity of all pairwise topic combinations using a modified Jaccard Coefficient.
Usage
jaccardTopics(
topics,
limit.rel,
limit.abs,
atLeast,
progress = TRUE,
pm.backend,
ncpus
)
Arguments
topics |
[ |
limit.rel |
[0,1] |
limit.abs |
[ |
atLeast |
[ |
progress |
[ |
pm.backend |
[ |
ncpus |
[ |
Details
The modified Jaccard Coefficient for two topics \bm z_{i}
and
\bm z_{j}
is calculated by
J_m(\bm z_{i}, \bm z_{j} \mid \bm c) = \frac{\sum_{v = 1}^{V} 1_{\left\{n_{i}^{(v)} > c_i ~\wedge~ n_{j}^{(v)} > c_j\right\}}\left(n_{i}^{(v)}, n_{j}^{(v)}\right)}{\sum_{v = 1}^{V} 1_{\left\{n_{i}^{(v)} > c_i ~\vee~ n_{j}^{(v)} > c_j\right\}}\left(n_{i}^{(v)}, n_{j}^{(v)}\right)}
with V
is the vocabulary size and n_k^{(v)}
is the count of
assignments of the v
-th word to the k
-th topic. The threshold vector \bm c
is determined by the maximum threshold of the user given lower bounds limit.rel
and limit.abs
. In addition, at least atLeast
words per topic are
considered for calculation. According to this, if there are less than
atLeast
words considered as relevant after applying limit.rel
and limit.abs
the atLeast
most common words per topic are taken
to determine topic similarities.
The procedure of determining relevant words is executed for each topic individually.
The values wordslimit
and wordsconsidered
describes the number
of relevant words per topic.
Value
[named list
] with entries
sims
[
lower triangular named matrix
] with all pairwise jaccard similarities of the given topics.wordslimit
[
integer
] with counts of words determined as relevant based onlimit.rel
andlimit.abs
.wordsconsidered
[
integer
] with counts of considered words for similarity calculation. Could differ fromwordslimit
, ifatLeast
is greater than zero.param
[
named list
] with parameter specifications fortype
[character(1)
]= "Jaccard Coefficient"
,limit.rel
[0,1],limit.abs
[integer(1)
] andatLeast
[integer(1)
]. See above for explanation.
See Also
Other TopicSimilarity functions:
cosineTopics()
,
dendTopics()
,
getSimilarity()
,
jsTopics()
,
rboTopics()
Other workflow functions:
LDARep()
,
SCLOP()
,
dendTopics()
,
getPrototype()
,
mergeTopics()
Examples
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
jacc = jaccardTopics(topics, atLeast = 2)
jacc
n1 = getConsideredWords(jacc)
n2 = getRelevantWords(jacc)
(n1 - n2)[n1 - n2 != 0]
sim = getSimilarity(jacc)
dim(sim)
# Comparison to Cosine and Jensen-Shannon (more interesting on large datasets)
cosine = cosineTopics(topics)
js = jsTopics(topics)
sims = list(jaccard = sim, cosine = getSimilarity(cosine), js = getSimilarity(js))
pairs(do.call(cbind, lapply(sims, as.vector)))