SimpleCorpus {tm} | R Documentation |
Simple Corpora
Description
Create simple corpora.
Usage
SimpleCorpus(x, control = list(language = "en"))
Arguments
x |
a |
control |
a named list of control parameters.
|
Details
A simple corpus is fully kept in memory. Compared to a VCorpus
,
it is optimized for the most common usage scenario: importing plain texts from
files in a directory or directly from a vector in R, preprocessing and
transforming the texts, and finally exporting them to a term-document matrix.
It adheres to the Corpus
API. However, it takes
internally various shortcuts to boost performance and minimize memory
pressure; consequently it operates only under the following contraints:
only
DataframeSource
,DirSource
andVectorSource
are supported,no custom readers, i.e., each document is read in and stored as plain text (as a string, i.e., a character vector of length one),
transformations applied via
tm_map
must be able to process character vectors and return character vectors (of the same length),no lazy transformations in
tm_map
,no meta data for individual documents (i.e., no
"local"
inmeta
).
Value
An object inheriting from SimpleCorpus
and Corpus
.
See Also
Corpus
for basic information on the corpus infrastructure
employed by package tm.
VCorpus
provides an implementation with volatile storage
semantics, and PCorpus
provides an implementation with
permanent storage semantics.
Examples
txt <- system.file("texts", "txt", package = "tm")
(ovid <- SimpleCorpus(DirSource(txt, encoding = "UTF-8"),
control = list(language = "lat")))