corpus_files {tm.plugin.koRpus} | R Documentation |
Get a comprehensive data frame describing the files of your corpus
Description
The function translates the hierarchy defintion given into a data frame with one row for each file, including the generated document ID.
Usage
corpus_files(
dir,
hierarchy = list(),
fsep = .Platform$file.sep,
full_list = FALSE
)
Arguments
dir |
File path to the root directory of the text corpus, or a TIF[1] compliant data frame. |
hierarchy |
A named list of named character vectors describing the directory hierarchy level by level.
If |
fsep |
Character string defining the path separator to use. |
full_list |
Logical, see return value. |
Value
Either a data frame with columns doc_id
, file
,
path
and one further factor
column for each hierarchy level,
or (if full_list=TRUE
) a list containing that data frame
(all_files
) and also data frames describing the hierarchy by given names (hier_names
),
directories (hier_dirs
) and relative paths (hier_paths
).
References
[1] Text Interchange Formats (https://github.com/ropensci/tif)
Examples
myCorpusFiles <- corpus_files(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus"
),
hierarchy=list(
Topic=c(
Winner="Reality Winner",
Edwards="Natalie Edwards"
),
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
)
)