newsgroups {lda} | R Documentation |
A collection of newsgroup messages with classes.
Description
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.
Usage
data(newsgroup.train.documents)
data(newsgroup.test.documents)
data(newsgroup.train.labels)
data(newsgroup.test.labels)
data(newsgroup.vocab)
data(newsgroup.label.map)
Format
newsgroup.train.documents
and newsgroup.test.documents
comprise a corpus of 20,000 newsgroup documents conforming to the LDA format,
partitioned into 11269 training and 7505 training and test cases evenly distributed
across 20 classes.
newsgroup.train.labels
is a numeric vector of length 11269 which gives
a class label from 1 to 20 for each training document in the corpus.
newsgroup.test.labels
is a numeric vector of length 7505 which gives
a class label from 1 to 20 for each training document in the corpus.
newsgroup.vocab
is the vocabulary of the corpus.
newsgroup.label.map
maps the numeric class labels to actual class names.
Source
http://qwone.com/~jason/20Newsgroups/
See Also
lda.collapsed.gibbs.sampler
for the format of the
corpus.
Examples
data(newsgroup.train.documents)
data(newsgroup.test.documents)
data(newsgroup.train.labels)
data(newsgroup.test.labels)
data(newsgroup.vocab)
data(newsgroup.label.map)