readCorpus {stm} | R Documentation |
Read in a corpus file.
Description
Converts pre-processed document matrices stored in popular formats to stm format.
Usage
readCorpus(corpus, type = c("dtm", "slam", "Matrix"))
Arguments
corpus |
An input file or filepath to be processed |
type |
The type of input file. We offer several sources, see details. |
Details
This function provides a simple utility for converting other document
formats to our own. Briefly- dtm
takes as input a standard matrix
and converts to our format. slam
converts from the
simple_triplet_matrix
representation used by the slam
package.
This is also the representation of corpora in the popular tm
package
and should work in those cases.
dtm
expects a matrix object where each row represents a document and
each column represents a word in the dictionary.
slam
expects a simple_triplet_matrix
from that
package.
Matrix
attempts to coerce the matrix to a
simple_triplet_matrix
and convert using the
functionality built for the slam
package. This will work for most
applicable classes in the Matrix
package such as dgCMatrix
.
If you are trying to read a .ldac
file see readLdac
.
Value
documents |
A documents object in our format |
vocab |
A vocab object if information is available to construct one |
See Also
textProcessor
, prepDocuments
readLdac
Examples
## Not run:
library(textir)
data(congress109)
out <- readCorpus(congress109Counts, type="Matrix")
documents <- out$documents
vocab <- out$vocab
## End(Not run)