R: Function to validate the fit of the LDA model

intruderTopics {tosca}

R Documentation

Function to validate the fit of the LDA model

Description

This function validates a LDA result by presenting a mix of topics and intruder topics to a human user, who has to identity them.

Usage

intruderTopics(
  text = NULL,
  beta = NULL,
  theta = NULL,
  id = NULL,
  numIntruder = 1,
  numOuttopics = 4,
  byScore = TRUE,
  minWords = 0L,
  minOuttopics = 0L,
  stopTopics = NULL,
  printSolution = FALSE,
  oldResult = NULL,
  test = FALSE,
  testinput = NULL
)

Arguments

`text`	A list of texts (e.g. the text element of a `textmeta` object).
`beta`	A matrix of word-probabilities or frequency table for the topics (e.g. the `topics` matrix from the `LDAgen` result). Each row is a topic, each column a word. The rows will be divided by the row sums, if they are not 1.
`theta`	A matrix of wordcounts per text and topic (e.g. the `document_sums` matrix from the `LDAgen` result). Each row is a topic, each column a text. In each cell stands the number of words in text j belonging to topic i.
`id`	Optional: character vector of text IDs that should be used for the function. Useful to start a inchoate coding task.
`numIntruder`	Intended number of intruder words. If `numIntruder` is a integer vector, the number would be sampled for each topic.
`numOuttopics`	tba Integer: Number of words per topic, including the intruder words
`byScore`	Logical: Should the score of `top.topic.words` from the `lda` package be used?
`minWords`	Integer: Minimum number of words for a choosen text.
`minOuttopics`	Integer: Minimal number of words a topic needs to be classified as a possible correct Topic.
`stopTopics`	Optional: Integer vector to deselect stopword topics for the coding task.
`printSolution`	Logical: If `TRUE` the coder gets a feedback after his/her vote.
`oldResult`	Result object from an unfinished run of `intruderWords`. If oldResult is used, all other parameter will be ignored.
`test`	Logical: Enables test mode
`testinput`	Input for function tests

Value

Object of class IntruderTopics. List of 11

`result`	Matrix of 3 columns. Each row represents one labeled text. `numIntruder` (1. column) gives the number of intruder topics inputated in this text, `missIntruder` (2. column) the number of the intruder topics which were not found by the coder and `falseIntruder` (3. column) the number of the topics choosen by the coder which were no intruder.
`beta`	Parameter of the function call
`theta`	Parameter of the function call
`id`	Charater Vector of IDs at the beginning
`byScore`	Parameter of the function call
`numIntruder`	Parameter of the function call
`numOuttopics`	Parameter of the function call
`minWords`	Parameter of the function call
`minOuttopics`	Parameter of the function call
`unusedID`	Character vector of unused text IDs for the next run
`stopTopics`	Parameter of the function call

References

Chang, Jonathan and Sean Gerrish and Wang, Chong and Jordan L. Boyd-graber and David M. Blei. Reading Tea Leaves: How Humans Interpret Topic Models. Advances in Neural Information Processing Systems, 2009.

Examples

## Not run: 
data(politics)
poliClean <- cleanTexts(politics)
words10 <- makeWordlist(text=poliClean$text)
words10 <- words10$words[words10$wordtable > 10]
poliLDA <- LDAprep(text=poliClean$text, vocab=words10)
LDAresult <- LDAgen(documents=poliLDA, K=10, vocab=words10)
intruder <- intruderTopics(text=politics$text, beta=LDAresult$topics,
                           theta=LDAresult$document_sums, id=names(poliLDA))

## End(Not run)

[Package tosca version 0.3-2 Index]