R: Extract the most representative words from topics

topWords {sentopics}

R Documentation

Extract the most representative words from topics

Description

Extract the top words in each topic/sentiment from a sentopicmodel.

Usage

topWords(
  x,
  nWords = 10,
  method = c("frequency", "probability", "term-score", "FREX"),
  output = c("data.frame", "plot", "matrix"),
  subset,
  w = 0.5
)

plot_topWords(
  x,
  nWords = 10,
  method = c("frequency", "probability", "term-score", "FREX"),
  subset,
  w = 0.5
)

Arguments

`x`	a `sentopicmodel` created from the `LDA()`, `JST()` or `rJST()`
`nWords`	the number of top words to extract
`method`	specify if a re-ranking function should be applied before returning the top words. See Details for a description of each method.
`output`	determines the output of the function
`subset`	allows to subset using a logical expression, as in `subset()`. Particularly useful to limit the number of observation on plot outputs. The logical expression uses topic and sentiment indices rather than their label. It is possible to subset on both topic and sentiment but adding a `&` operator between two expressions.
`w`	only used when `method = "FREX"`. Determines the weight assigned to the exclusivity score at the expense of the frequency score.

Details

"frequency" ranks top words according to their frequency within a topic. This method also reports the overall frequency of each word. When returning a plot, the overall frequency is represented with a grey bar.

"probability" uses the estimated topic-word mixture \phi to rank top words.

"term-score" implements the re-ranking method from Blei and Lafferty (2009). This method down-weights terms that have high probability in all topics using the following score:

\text{term-score}_{k,v} = \phi_{k, v}\log\left(\frac{\phi_{k, v}}{\left(\prod^K_{j=1}\phi_{j,v}\right)^{\frac{1}{K}}}\right),

for topic k, vocabulary word v and number of topics K.

"FREX" implements the re-ranking method from Bischof and Airoldi (2012). This method used the weight w to balance between topic-word probability and topic exclusivity using the following score:

\text{FREX}_{k,v}=\left(\frac{w}{\text{ECDF}\left( \frac{\phi_{k,v}}{\sum_{j=1}^K\phi_{k,v}}\right)} + \frac{1-w}{\text{ECDF}\left(\phi_{k,v}\right)} \right),

for topic k, vocabulary word v, number of topics K and weight w, where \text{ECDF} is the empirical cumulative distribution function.

Value

The top words of the topic model. Depending on the output chosen, can result in either a long-style data.frame, a ggplot2 object or a matrix.

Author(s)

Olivier Delmarcelle

References

Blei, DM. and Lafferty, JD. (2009). Topic models.. In Text Mining, chapter 4, 101–124.

Bischof JM. and Airoldi, EM. (2012). Summarizing Topical Content with Word Frequency and Exclusivity.. In Proceedings of the 29th International Conference on International Conference on Machine Learning, ICML'12, 9–16.

Examples

model <- LDA(ECB_press_conferences_tokens)
model <- fit(model, 10)
topWords(model)
topWords(model, output = "matrix")
topWords(model, method = "FREX")
plot_topWords(model)
plot_topWords(model, subset = topic %in% 1:2)

jst <- JST(ECB_press_conferences_tokens)
jst <- fit(jst, 10)
plot_topWords(jst)
plot_topWords(jst, subset = topic %in% 1:2 & sentiment == 3)

[Package sentopics version 0.7.3 Index]