R: Create kRp.corpus objects from text files or data frames

readCorpus {tm.plugin.koRpus}

R Documentation

Create kRp.corpus objects from text files or data frames

Description

You can either read a corpus from text files (one file per text, also see the Hierarchy section below) or from TIF compliant data frames (see the Data frames section below).

Usage

readCorpus(
  dir,
  hierarchy = list(),
  lang = "kRp.env",
  tagger = "kRp.env",
  encoding = "",
  pattern = NULL,
  recursive = FALSE,
  ignore.case = FALSE,
  mode = "text",
  format = "file",
  mc.cores = getOption("mc.cores", 1L),
  id = "",
  ...
)

Arguments

`dir`	Either a file path to the root directory of the text corpus, or a TIF compliant data frame. If a directory path (character string), texts can be recursively ordered into subfolders named exactly as defined by `hierarchy`. If `hierarchy` is an empty list, all text files located in `dir` are parsed without a hierachical structure. If a data frame, also set `format="obj"` and provide hierarchy levels as additional columns, as described in the Data frames section.
`hierarchy`	A named list of named character vectors describing the directory hierarchy level by level. If `TRUE` instead, the hierarchy structure is taken directly from the directory tree. See section Hierarchy for details.
`lang`	A character string naming the language of the analyzed corpus. See `kRp.POS.tags` for all supported languages. If set to `"kRp.env"` this is got from `get.kRp.env`. This information will also be passed to the `readerControl` list of the `VCorpus` call.
`tagger`	A character string pointing to the tokenizer/tagger command you want to use for basic text analysis. Defaults to `tagger="kRp.env"` to get the settings by `get.kRp.env`. Set to `"tokenize"` to use `tokenize`.
`encoding`	Character string describing the current encoding. See `DirSource` for details, omitted if `format="obj"`.
`pattern`	A regular expression for file matching. See `DirSource` for details, omitted if `format="obj"`.
`recursive`	Logical, indicates whether directories should be read recursively. See `DirSource` for details, omitted if `format="obj"`.
`ignore.case`	Logical, indicates whether `pattern` is matched case sensitive. See `DirSource` for details, omitted if `format="obj"`.
`mode`	Character string defining the reading mode. See `DirSource` for details, omitted if `format="obj"`.
`format`	Either "file" or "obj", depending on whether you want to scan files or analyze the text in a given object, like a character vector. If the latter and `treetag` is used as the `tagger`, texts will be written to temporary files for the process (see `dir`).
`mc.cores`	The number of cores to use for parallelization, see `mclapply`. This value is passed through to simpleCorpus.
`id`	A character string describing the main subject/purpose of the text corpus.
`...`	Additional options which are passed through to the defined `tagger`.

Value

An object of class kRp.corpus.

Hierarchy

To import a hierarchically structured text corpus you must categorize all texts in a directory structure that resembles the hierarchy. If for example you would like to import a corpus on two different topics and two differnt sources, your hierarchy has two nested levels (topic and source). The root directory dir would then need to have two subdirectories (one for each topic) which in turn must have two subdirectories (one for each source), and the actual text files are found in those.

To use this hierarchical structure in readCorpus, the hierarchy argument is used. It is a named list, where each list item represents one hierachical level (here again topic and source), and its value is a named character vector describing the actual topics and sources to be used. It is important to understand how these character vectors are treated: The names of elements must exactly match the corresponding subdirectroy name, whereas the value is a free text description. The names of the list items however describe the hierachical level and are not matched with directory names.

Data frames

In order to import a corpus from a data frame, the object must be in Text Interchange Format (TIF) as described by [1]. As a minimum, the data frame must have two character columns, doc_id and text.

You can provide additional information on hierarchical categories by using further columns, where the column name must match the category name (hierachical level). The order of those columns in the data frame is not important, as you must still fully define the hierarchical structure using the hierarchy argument. All columns you omit are ignored, but the values used in the hierarchy list and the respective columns must match, as rows with unmatched category levels will also be ignored.

Note that the special column names path and file will also be imported automatically.

References

[1] Text Interchange Formats (https://github.com/ropensci/tif)

Examples

# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
  # "flat" corpus, parse all texts in the given dir
  myCorpus <- readCorpus(
    dir=file.path(
      path.package("tm.plugin.koRpus"), "examples", "corpus", "Winner", "Wikipedia_prev"
    ),
    # use tokenize() so examples run without a TreeTagger installation
    tagger="tokenize",
    lang="en"
  )
 
  # corpus with one category names "Source"
  myCorpus <- readCorpus(
    dir=file.path(
      path.package("tm.plugin.koRpus"), "examples", "corpus", "Winner"
    ),
    hierarchy=list(
      Source=c(
        Wikipedia_prev="Wikipedia (old)",
        Wikipedia_new="Wikipedia (new)"
      )
    ),
    tagger="tokenize",
    lang="en"
  )
 
  # two hieraryhical levels, "Topic" and "Source"
  myCorpus <- readCorpus(
    dir=file.path(path.package("tm.plugin.koRpus"), "examples", "corpus"),
    hierarchy=list(
      Topic=c(
        Winner="Reality Winner",
        Edwards="Natalie Edwards"
      ),
      Source=c(
        Wikipedia_prev="Wikipedia (old)",
        Wikipedia_new="Wikipedia (new)"
      )
    ),
    tagger="tokenize",
    lang="en"
  )
 
  # get hierarchy from directory tree
  myCorpus <- readCorpus(
    dir=file.path(path.package("tm.plugin.koRpus"), "examples", "corpus"),
    hierarchy=TRUE,
    tagger="tokenize",
    lang="en"
  )
  
  ## Not run: 
    # if the same corpus is available as TIF compliant data frame
    myCorpus <- readCorpus(
      dir=myCorpus_df,
      hierarchy=list(
        Topic=c(
          Winner="Reality Winner",
          Edwards="Natalie Edwards"
        ),
        Source=c(
          Wikipedia_prev="Wikipedia (old)",
          Wikipedia_new="Wikipedia (new)"
        )
      ),
      lang="en",
      format="obj"
    )
  
## End(Not run)
} else {}

[Package tm.plugin.koRpus version 0.4-2 Index]