intruderTopics {tosca}R Documentation

Function to validate the fit of the LDA model

Description

This function validates a LDA result by presenting a mix of topics and intruder topics to a human user, who has to identity them.

Usage

intruderTopics(
  text = NULL,
  beta = NULL,
  theta = NULL,
  id = NULL,
  numIntruder = 1,
  numOuttopics = 4,
  byScore = TRUE,
  minWords = 0L,
  minOuttopics = 0L,
  stopTopics = NULL,
  printSolution = FALSE,
  oldResult = NULL,
  test = FALSE,
  testinput = NULL
)

Arguments

text

A list of texts (e.g. the text element of a textmeta object).

beta

A matrix of word-probabilities or frequency table for the topics (e.g. the topics matrix from the LDAgen result). Each row is a topic, each column a word. The rows will be divided by the row sums, if they are not 1.

theta

A matrix of wordcounts per text and topic (e.g. the document_sums matrix from the LDAgen result). Each row is a topic, each column a text. In each cell stands the number of words in text j belonging to topic i.

id

Optional: character vector of text IDs that should be used for the function. Useful to start a inchoate coding task.

numIntruder

Intended number of intruder words. If numIntruder is a integer vector, the number would be sampled for each topic.

numOuttopics

tba Integer: Number of words per topic, including the intruder words

byScore

Logical: Should the score of top.topic.words from the lda package be used?

minWords

Integer: Minimum number of words for a choosen text.

minOuttopics

Integer: Minimal number of words a topic needs to be classified as a possible correct Topic.

stopTopics

Optional: Integer vector to deselect stopword topics for the coding task.

printSolution

Logical: If TRUE the coder gets a feedback after his/her vote.

oldResult

Result object from an unfinished run of intruderWords. If oldResult is used, all other parameter will be ignored.

test

Logical: Enables test mode

testinput

Input for function tests

Value

Object of class IntruderTopics. List of 11

result

Matrix of 3 columns. Each row represents one labeled text. numIntruder (1. column) gives the number of intruder topics inputated in this text, missIntruder (2. column) the number of the intruder topics which were not found by the coder and falseIntruder (3. column) the number of the topics choosen by the coder which were no intruder.

beta

Parameter of the function call

theta

Parameter of the function call

id

Charater Vector of IDs at the beginning

byScore

Parameter of the function call

numIntruder

Parameter of the function call

numOuttopics

Parameter of the function call

minWords

Parameter of the function call

minOuttopics

Parameter of the function call

unusedID

Character vector of unused text IDs for the next run

stopTopics

Parameter of the function call

References

Chang, Jonathan and Sean Gerrish and Wang, Chong and Jordan L. Boyd-graber and David M. Blei. Reading Tea Leaves: How Humans Interpret Topic Models. Advances in Neural Information Processing Systems, 2009.

Examples

## Not run: 
data(politics)
poliClean <- cleanTexts(politics)
words10 <- makeWordlist(text=poliClean$text)
words10 <- words10$words[words10$wordtable > 10]
poliLDA <- LDAprep(text=poliClean$text, vocab=words10)
LDAresult <- LDAgen(documents=poliLDA, K=10, vocab=words10)
intruder <- intruderTopics(text=politics$text, beta=LDAresult$topics,
                           theta=LDAresult$document_sums, id=names(poliLDA))

## End(Not run)

[Package tosca version 0.3-2 Index]