readCorpus {tm.plugin.koRpus} | R Documentation |
Create kRp.corpus objects from text files or data frames
Description
You can either read a corpus from text files (one file per text, also see the Hierarchy section below) or from TIF compliant data frames (see the Data frames section below).
Usage
readCorpus(
dir,
hierarchy = list(),
lang = "kRp.env",
tagger = "kRp.env",
encoding = "",
pattern = NULL,
recursive = FALSE,
ignore.case = FALSE,
mode = "text",
format = "file",
mc.cores = getOption("mc.cores", 1L),
id = "",
...
)
Arguments
dir |
Either a file path to the root directory of the text corpus,
or a TIF compliant data frame.
If a directory path (character string),
texts can be recursively ordered into subfolders named
exactly as defined by |
hierarchy |
A named list of named character vectors describing the directory hierarchy level by level.
If |
lang |
A character string naming the language of the analyzed corpus.
See |
tagger |
A character string pointing to the tokenizer/tagger command you want to use for basic text analysis.
Defaults to |
encoding |
Character string describing the current encoding.
See |
pattern |
A regular expression for file matching.
See |
recursive |
Logical, indicates whether directories should be read recursively.
See |
ignore.case |
Logical, indicates whether |
mode |
Character string defining the reading mode.
See |
format |
Either "file" or "obj",
depending on whether you want to scan files or analyze the text in a given object,
like a character vector. If the latter and |
mc.cores |
The number of cores to use for parallelization,
see |
id |
A character string describing the main subject/purpose of the text corpus. |
... |
Additional options which are passed through to the defined |
Value
An object of class kRp.corpus
.
Hierarchy
To import a hierarchically structured text corpus you must categorize all texts in a directory
structure that resembles the hierarchy. If for example you would like to import a corpus on two
different topics and two differnt sources,
your hierarchy has two nested levels (topic and source).
The root directory dir
would then need to have two subdirectories (one for each topic)
which in turn must have two subdirectories (one for each source),
and the actual text files
are found in those.
To use this hierarchical structure in readCorpus
,
the hierarchy
argument is used.
It is a named list,
where each list item represents one hierachical level (here again topic and source),
and its value is a named character vector describing the actual topics and sources to be used. It is
important to understand how these character vectors are treated: The names of elements must exactly match
the corresponding subdirectroy name,
whereas the value is a free text description. The names of the
list items however describe the hierachical level and are not matched with directory names.
Data frames
In order to import a corpus from a data frame,
the object must be in Text Interchange Format (TIF)
as described by [1]. As a minimum, the data frame must have two character columns,
doc_id
and text
.
You can provide additional information on hierarchical categories by using further
columns,
where the column name must match the category name (hierachical level). The order of those
columns in the data frame is not important,
as you must still fully define the hierarchical structure
using the hierarchy
argument. All columns you omit are ignored,
but the values used in
the hierarchy
list and the respective columns must match,
as rows with unmatched category levels
will also be ignored.
Note that the special column names path
and file
will also be imported automatically.
References
[1] Text Interchange Formats (https://github.com/ropensci/tif)
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
# "flat" corpus, parse all texts in the given dir
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Winner", "Wikipedia_prev"
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
# corpus with one category names "Source"
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Winner"
),
hierarchy=list(
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
tagger="tokenize",
lang="en"
)
# two hieraryhical levels, "Topic" and "Source"
myCorpus <- readCorpus(
dir=file.path(path.package("tm.plugin.koRpus"), "examples", "corpus"),
hierarchy=list(
Topic=c(
Winner="Reality Winner",
Edwards="Natalie Edwards"
),
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
tagger="tokenize",
lang="en"
)
# get hierarchy from directory tree
myCorpus <- readCorpus(
dir=file.path(path.package("tm.plugin.koRpus"), "examples", "corpus"),
hierarchy=TRUE,
tagger="tokenize",
lang="en"
)
## Not run:
# if the same corpus is available as TIF compliant data frame
myCorpus <- readCorpus(
dir=myCorpus_df,
hierarchy=list(
Topic=c(
Winner="Reality Winner",
Edwards="Natalie Edwards"
),
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
lang="en",
format="obj"
)
## End(Not run)
} else {}