DistributedCorpus {tm.plugin.dc} | R Documentation |
Distributed Corpus
Description
Data structures and operators for distributed corpora.
Usage
DCorpus( x,
readerControl = list(reader = reader(x),
language = "en"),
storage = NULL, keep = TRUE, ... )
## S3 method for class 'DCorpus'
as.VCorpus(x)
as.DCorpus( x, storage = NULL, ... )
Arguments
x |
for |
readerControl |
A list with the named components |
storage |
The storage subsystem to use with the DCorpus. Currently two types of storages are supported: local disk storage using the Local File System (LFS) and the Hadoop Distributed File System (HDFS). Default: 'LFS'. |
keep |
Should revisions be used when operating on the
|
... |
Optional arguments for the |
Details
When constructing a distributed corpus the input source is
extracted via the supplied reader and stored on the given file
system (argument storage
). While the data set resides on the
corresponding storage (e.g., HDFS), only a symbolic representation is
held in R (a so-called DList
) which allows to
access the corpus via corresponding (DList
) methods. Since the
available memory for the distributed corpus is only restricted by
available disk space in the given storage (and not main memory like in
a standard tm corpus) by default we also store a set of
so-called revisions, i.e., stages of the (processed) corpus. Revisions
can be turned off later on using the keepRevisions()
replacement function.\
The constructed corpus object inherits from a tm
Corpus
and has several slots containing meta
information:
meta
Corpus Meta Data contains corpus specific meta data in form of tag-value pairs.
dmeta
Document Meta Data of class
data.frame
contains document specific meta data for the corpus. This is mainly available to be compatible with standard tm corpus definitions but not yet actually used in the distributed scenario.keep
A logical indicating whether revisions representing stages e.g., in a preprocessing chain should be kept or not.
Value
An object inheriting from DCorpus
and Corpus
.
Author(s)
Ingo Feinerer and Stefan Theussl
See Also
Corpus
for basic information on the corpus infrastructure
employed by package tm.
Examples
## Similar to example in package 'tm'
reut21578 <- system.file("texts", "crude", package = "tm")
dc <- DistributedCorpus(DirSource(reut21578),
readerControl = list(reader = readReut21578XMLasPlain) )
dc
## Coercion
data("crude")
as.DistributedCorpus(crude)
as.VCorpus(dc)