| kRp.corpus,-class {tm.plugin.koRpus} | R Documentation |
S4 Class kRp.corpus
Description
Objects of this class can contain full text corpora in a hierachical structure. It supports both the tm package's
Corpus class and koRpus' own object classes and stores them in separated slots.
Details
Objects should be created using the readCorpus function.
Slots
langA character string, naming the language that is assumed for the tokenized texts in this object.
descA named list of descriptive statistics of the tagged texts.
metaA named list. Can be used to store meta information. Currently, no particular format is defined.
rawA list of objects of class
Corpus.tokensA data frame as used for the
tokensslot in objects of classkRp.text. In addition to the columns usually found in those objects, this data frame also has a factor column for each hierarchical category defined (if any).featuresA named logical vector, indicating which features are available in this object's
feat_listslot. Common features are listed in the description of thefeat_listslot.feat_listA named list with optional analysis results or other content as used by the defined
features:hierarchyA named list of named character vectors describing the directory hierarchy level by level.hyphenA named list of objects of classkRp.hyphen.readabilityA named list of objects of classkRp.readability.lex_divA named list of objects of classkRp.TTR.freqThefreq.analysisslot of akRp.txt.freqclass object afterfreq.analysiswas called.corp_freqAn object of classkRp.corp.freq, e.g., results of a call toread.corp.custom.diffA named list ofdifffeatures of akRp.textobject after a method liketextTransformwas called.summaryA summary data frame for the full corpus, including descriptive statistics on all texts, as well as results of analyses like readability and lexical diversity, if available.doc_term_matrixA sparse document-term matrix, as produced bydocTermMatrix.stopwordsA numeric vector with the total number of stopwords in each text, if stopwords were analyzed during tokenizing or POS tagging.
See the
getter and setter methodsfor easy access to these sub-slots. There can actually be any number of additional features, the above is just a list of those already defined by this package.
Contructor function
Should you need to manually generate objects of this class (which should rarely be the case),
the contructor function
kRp.corpus(...) can be used instead of
new("kRp.corpus", ...). Whenever possible, stick to
readCorpus.
Note
There is also getter and setter methods for objects of this class.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(path.package("tm.plugin.koRpus"), "examples", "corpus"),
hierarchy=list(
Topic=c(
Winner="Reality Winner",
Edwards="Natalie Edwards"
),
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
} else {}
# manual creation
emptyCorpus <- kRp.corpus()