corpus_files {tm.plugin.koRpus}R Documentation

Get a comprehensive data frame describing the files of your corpus

Description

The function translates the hierarchy defintion given into a data frame with one row for each file, including the generated document ID.

Usage

corpus_files(
  dir,
  hierarchy = list(),
  fsep = .Platform$file.sep,
  full_list = FALSE
)

Arguments

dir

File path to the root directory of the text corpus, or a TIF[1] compliant data frame.

hierarchy

A named list of named character vectors describing the directory hierarchy level by level. If TRUE instead, the hierarchy structure is taken directly from the directory tree. See section Hierarchy of readCorpus for details.

fsep

Character string defining the path separator to use.

full_list

Logical, see return value.

Value

Either a data frame with columns doc_id, file, path and one further factor column for each hierarchy level, or (if full_list=TRUE) a list containing that data frame (all_files) and also data frames describing the hierarchy by given names (hier_names), directories (hier_dirs) and relative paths (hier_paths).

References

[1] Text Interchange Formats (https://github.com/ropensci/tif)

Examples

myCorpusFiles <- corpus_files(
  dir=file.path(
    path.package("tm.plugin.koRpus"), "examples", "corpus"
  ),
  hierarchy=list(
    Topic=c(
      Winner="Reality Winner",
      Edwards="Natalie Edwards"
    ),
    Source=c(
      Wikipedia_prev="Wikipedia (old)",
      Wikipedia_new="Wikipedia (new)"
    )
  )
)

[Package tm.plugin.koRpus version 0.4-2 Index]