R: Get the Top Words and Documents in Each Topic

top.topic.words {lda}

R Documentation

Get the Top Words and Documents in Each Topic

Description

This function takes a model fitted using lda.collapsed.gibbs.sampler and returns a matrix of the top words in each topic.

Usage

top.topic.words(topics, num.words = 20, by.score = FALSE)
top.topic.documents(document_sums, num.documents = 20, alpha = 0.1)

Arguments

`topics`	For `top.topic.words`, a `K \times V` matrix where each entry is a numeric proportional to the probability of seeing the word (column) conditioned on topic (row) (this entry is sometimes denoted `\beta_{w,k}` in the literature, see details). The column names should correspond to the words in the vocabulary. The `topics` field from the output of `lda.collapsed.gibbs.sampler` can be used.
`num.words`	For `top.topic.words`, the number of top words to return for each topic.
`document_sums`	For `top.topic.documents`, a `K \times D` matrix where each entry is a numeric proportional to the probability of seeing a topic (row) conditioned on the document (column) (this entry is sometimes denoted `\theta_{d,k}` in the literature, see details). The `document_sums` field from the output of `lda.collapsed.gibbs.sampler` can be used.
`num.documents`	For `top.topic.documents`, the number of top documents to return for each topic.
`by.score`	If `by.score` is set to `FALSE` (default), then words in each topic will be ranked according to probability mass for each word `\beta_{w, k}`. If `by.score` is `TRUE`, then words will be ranked according to a score defined by `\beta_{w, k} (\log \beta_{w,k} - 1 / K \sum_{k'} \log \beta_{w,k'})`.
`alpha`	The scalar value of the Dirichlet hyperparameter for topic proportions.

Value

For top.topic.words, a num.words \times K character matrix where each column contains the top words for that topic.

For top.topic.documents, a num.documents \times K integer matrix where each column contains the top documents for that topic. The entries in the matrix are column-indexed references into document_sums.

Author(s)

Jonathan Chang (slycoder@gmail.com)

References

Blei, David M. and Ng, Andrew and Jordan, Michael. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003.

Examples

## From demo(lda).

data(cora.documents)
data(cora.vocab)

K <- 10 ## Num clusters
result <- lda.collapsed.gibbs.sampler(cora.documents,
                                      K,  ## Num clusters
                                      cora.vocab,
                                      25,  ## Num iterations
                                      0.1,
                                      0.1) 

## Get the top words in the cluster
top.words <- top.topic.words(result$topics, 5, by.score=TRUE)

## top.words:
##      [,1]             [,2]        [,3]       [,4]            [,5]      
## [1,] "decision"       "network"   "planning" "learning"      "design"  
## [2,] "learning"       "time"      "visual"   "networks"      "logic"   
## [3,] "tree"           "networks"  "model"    "neural"        "search"  
## [4,] "trees"          "algorithm" "memory"   "system"        "learning"
## [5,] "classification" "data"      "system"   "reinforcement" "systems" 
##      [,6]         [,7]       [,8]           [,9]           [,10]      
## [1,] "learning"   "models"   "belief"       "genetic"      "research" 
## [2,] "search"     "networks" "model"        "search"       "reasoning"
## [3,] "crossover"  "bayesian" "theory"       "optimization" "grant"    
## [4,] "algorithm"  "data"     "distribution" "evolutionary" "science"  
## [5,] "complexity" "hidden"   "markov"       "function"     "supported"

[Package lda version 1.5.2 Index]