convertCorpus {stm} | R Documentation |
Convert stm formatted documents to another format
Description
Takes an stm formatted documents and vocab object and returns formats usable in other packages.
Usage
convertCorpus(documents, vocab, type = c("slam", "lda", "Matrix"))
Arguments
documents |
the documents object in stm format |
vocab |
the vocab object in stm format |
type |
the output type desired. See Details. |
Details
We also recommend the quanteda and tm packages for text preparation
etc. The convertCorpus
function is provided as a helpful utility for
moving formats around, but if you intend to do text processing with a variety
of output formats, you likely want to start with quanteda or tm.
The various type conversions are described below:
type = "slam"
Converts to the simple triplet matrix representation used by the slam package. This is the format used internally by tm.
type = "lda"
Converts to the format used by the lda package. This is a very minor change as the format in stm is based on lda's data representation. The difference as noted in
stm
involves how the numbers are indexed. Accordingly this type returns a list containing the new documents object and the unchanged vocab object.type = "Matrix"
Converts to the sparse matrix representation used by Matrix. This is the format used internally by numerous other text analysis packages.
If you want to write
out a file containing the sparse matrix representation popularized by David
Blei's C
code ldac
see the function writeLdac
.
See Also
writeLdac
readCorpus
poliblog5k
Examples
#convert the poliblog5k data to slam package format
poliSlam <- convertCorpus(poliblog5k.docs, poliblog5k.voc, type="slam")
class(poliSlam)
poliMatrix <- convertCorpus(poliblog5k.docs, poliblog5k.voc, type="Matrix")
class(poliMatrix)
poliLDA <- convertCorpus(poliblog5k.docs, poliblog5k.voc, type="lda")
str(poliLDA)