| readCorpus {stm} | R Documentation |
Read in a corpus file.
Description
Converts pre-processed document matrices stored in popular formats to stm format.
Usage
readCorpus(corpus, type = c("dtm", "slam", "Matrix"))
Arguments
corpus |
An input file or filepath to be processed |
type |
The type of input file. We offer several sources, see details. |
Details
This function provides a simple utility for converting other document
formats to our own. Briefly- dtm takes as input a standard matrix
and converts to our format. slam converts from the
simple_triplet_matrix representation used by the slam package.
This is also the representation of corpora in the popular tm package
and should work in those cases.
dtm expects a matrix object where each row represents a document and
each column represents a word in the dictionary.
slam expects a simple_triplet_matrix from that
package.
Matrix attempts to coerce the matrix to a
simple_triplet_matrix and convert using the
functionality built for the slam package. This will work for most
applicable classes in the Matrix package such as dgCMatrix.
If you are trying to read a .ldac file see readLdac.
Value
documents |
A documents object in our format |
vocab |
A vocab object if information is available to construct one |
See Also
textProcessor, prepDocuments readLdac
Examples
## Not run:
library(textir)
data(congress109)
out <- readCorpus(congress109Counts, type="Matrix")
documents <- out$documents
vocab <- out$vocab
## End(Not run)