R: Pairwise Cosine Similarities

cosineTopics {ldaPrototype}

R Documentation

Pairwise Cosine Similarities

Description

Calculates the similarity of all pairwise topic combinations using the Cosine Similarity.

Usage

cosineTopics(topics, progress = TRUE, pm.backend, ncpus)

Arguments

`topics`	[`named matrix`] The counts of vocabularies/words (row wise) in topics (column wise).
`progress`	[`logical(1)`] Should a nice progress bar be shown? Turning it off, could lead to significantly faster calculation. Default is `TRUE`. If `pm.backend` is set, parallelization is done and no progress bar will be shown.
`pm.backend`	[`character(1)`] One of "multicore", "socket" or "mpi". If `pm.backend` is set, `parallelStart` is called before computation is started and `parallelStop` is called after.
`ncpus`	[`integer(1)`] Number of (physical) CPUs to use. If `pm.backend` is passed, default is determined by `availableCores`.

Details

The Cosine Similarity for two topics \bm z_{i} and \bm z_{j} is calculated by

\cos(\theta | \bm z_{i}, \bm z_{j}) = \frac{ \sum_{v=1}^{V}{n_{i}^{(v)} n_{j}^{(v)}} }{ \sqrt{\sum_{v=1}^{V}{\left(n_{i}^{(v)}\right)^2}} \sqrt{\sum_{v=1}^{V}{\left(n_{j}^{(v)}\right)^2}} }

with \theta determining the angle between the corresponding count vectors \bm z_{i} and \bm z_{j}, V is the vocabulary size and n_k^{(v)} is the count of assignments of the v-th word to the k-th topic.

Value

[named list] with entries

sims: [lower triangular named matrix] with all pairwise similarities of the given topics.
wordslimit: [integer] = vocabulary size. See jaccardTopics for original purpose.
wordsconsidered: [integer] = vocabulary size. See jaccardTopics for original purpose.
param: [named list] with parameter type [character(1)] = "Cosine Similarity".

Examples

res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
cosine = cosineTopics(topics)
cosine

sim = getSimilarity(cosine)
dim(sim)

[Package ldaPrototype version 0.3.1 Index]