kRp.corpus,-class {tm.plugin.koRpus} | R Documentation |
S4 Class kRp.corpus
Description
Objects of this class can contain full text corpora in a hierachical structure. It supports both the tm
package's
Corpus
class and koRpus
' own object classes and stores them in separated slots.
Details
Objects should be created using the readCorpus
function.
Slots
lang
A character string, naming the language that is assumed for the tokenized texts in this object.
desc
A named list of descriptive statistics of the tagged texts.
meta
A named list. Can be used to store meta information. Currently, no particular format is defined.
raw
A list of objects of class
Corpus
.tokens
A data frame as used for the
tokens
slot in objects of classkRp.text
. In addition to the columns usually found in those objects, this data frame also has a factor column for each hierarchical category defined (if any).features
A named logical vector, indicating which features are available in this object's
feat_list
slot. Common features are listed in the description of thefeat_list
slot.feat_list
A named list with optional analysis results or other content as used by the defined
features
:hierarchy
A named list of named character vectors describing the directory hierarchy level by level.hyphen
A named list of objects of classkRp.hyphen
.readability
A named list of objects of classkRp.readability
.lex_div
A named list of objects of classkRp.TTR
.freq
Thefreq.analysis
slot of akRp.txt.freq
class object afterfreq.analysis
was called.corp_freq
An object of classkRp.corp.freq
, e.g., results of a call toread.corp.custom
.diff
A named list ofdiff
features of akRp.text
object after a method liketextTransform
was called.summary
A summary data frame for the full corpus, including descriptive statistics on all texts, as well as results of analyses like readability and lexical diversity, if available.doc_term_matrix
A sparse document-term matrix, as produced bydocTermMatrix
.stopwords
A numeric vector with the total number of stopwords in each text, if stopwords were analyzed during tokenizing or POS tagging.
See the
getter and setter methods
for easy access to these sub-slots. There can actually be any number of additional features, the above is just a list of those already defined by this package.
Contructor function
Should you need to manually generate objects of this class (which should rarely be the case),
the contructor function
kRp.corpus(...)
can be used instead of
new("kRp.corpus", ...)
. Whenever possible, stick to
readCorpus
.
Note
There is also getter and setter methods
for objects of this class.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(path.package("tm.plugin.koRpus"), "examples", "corpus"),
hierarchy=list(
Topic=c(
Winner="Reality Winner",
Edwards="Natalie Edwards"
),
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
} else {}
# manual creation
emptyCorpus <- kRp.corpus()