R: Manage Corpus Data and Encode CWB Corpus.

CorpusData {cwbtools}

R Documentation

Manage Corpus Data and Encode CWB Corpus.

Description

Manage Corpus Data and Encode CWB Corpus.

Details

See the CWB Encoding Tutorial on characters allowed for encoding attributes: "By convention, all attribute names must be lowercase (more precisely, they may only contain the characters a-z, 0-9, -, and _, and may not start with a digit). Therefore, the names of XML elements to be included in the CWB corpus must not contain any non-ASCII or uppercase letters." (section 2)

Import XML files.

Public fields

chunktable: A data.table with column "id" (unique values), columns with metadata, and a column with text chunks.
tokenstream: A data.table with a column "cpos" (corpus position), and columns with positional attributes, such as "word", "lemma", "pos", "stem".
metadata: A data.table with a column "id", to link data with chunks/tokenstream, columns with document-level metadata, and a column "cpos_left" and "cpos_right", which can be generated using method ⁠$add_corpus_positions()⁠.
sentences: A data.table.
named_entities: A data.table.

Methods

Method `new()`

Initialize a new instance of class CorpusData.

Usage

CorpusData$new()

Returns

A class CorpusData object.

Method `print()`

Print summary of CorpusData object.

Usage

CorpusData$print()

Method `tokenize()`

Simple tokenization of text in chunktable.

Usage

CorpusData$tokenize(..., verbose = TRUE, progress = TRUE)

Arguments

...: Arguments that are passed into tokenizers::tokenize_words().
verbose: A logical value, whether to be verbose.
progress: A logical value, whether to show progress bar.

Method `import_xml()`

Usage

CorpusData$import_xml(
  filenames,
  body = "//body",
  meta = NULL,
  mc = NULL,
  progress = TRUE
)

Arguments

filenames: A vector of files to process.
body: An xpath expression defining the body of the XML document.
meta: A named character vector with XPath expressions.
mc: A numeric/integer value, number of cores to use.
progress: A logical value, whether to show progress bar.

Returns

The CorpusData object is returned invisibly.

Method `add_corpus_positions()`

Add column 'cpos' to tokenstream and columns 'cpos_left' and 'cpos_right' to metadata.

Usage

CorpusData$add_corpus_positions(verbose = TRUE)

Arguments

verbose: A logical value, whether to be verbose.

Method `purge()`

Remove patterns from chunkdata that are known to cause problems. This is done most efficiently at the chunkdata level of data preparation as the length of the character vector to handle is much smaller than when tokenization/annotation has been performed.

Usage

CorpusData$purge(
  replacements = list(c("^\\s*<.*?>\\s*$", ""), c("’", "'"))
)

Arguments

replacements: A list of length-two character vectors with regular expressions and replacements.

Method `encode()`

Encode corpus. If the corpus already exists, it will be removed.

Usage

CorpusData$encode(
  corpus,
  p_attributes = "word",
  s_attributes = NULL,
  encoding,
  registry_dir = Sys.getenv("CORPUS_REGISTRY"),
  data_dir = NULL,
  method = c("R", "CWB"),
  verbose = TRUE,
  compress = FALSE,
  reload = TRUE,
  quietly = TRUE
)

Arguments

corpus: The name of the CWB corpus.
p_attributes: Positional attributes.
s_attributes: Columns that will be encoded as structural attributes.
encoding: Encoding/charset of the CWB corpus.
registry_dir: Corpus registry, the directory where registry files are stored.
data_dir: Directory where to create directory for indexed corpus files.
method: Either "R" or "CWB".
verbose: A logical value, whether to be verbose.
compress: A logical value, whether to compress corpus.
reload: A logical value, whether to reload corpus.
quietly: A logical value passed into RcppCWB::cwb_makeall(), RcppCWB::cwb_huffcode() and RcppCWB::cwb_compress_rdx to control verbosity of these functions.

Method `clone()`

The objects of this class are cloneable with this method.

Usage

CorpusData$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Examples

library(RcppCWB)
library(data.table)

# this example relies on the R method to write data to disk, there is also a method "CWB"
# that relies on CWB tools to generate the indexed corpus. The CWB can downloaded
# and installed within the package by calling cwb_install()

# create temporary registry file so that data in RcppCWB package can be used

registry_rcppcwb <- system.file(package = "RcppCWB", "extdata", "cwb", "registry")
registry_tmp <- fs::path(tempdir(), "registry")
if (!dir.exists(registry_tmp)) dir.create(registry_tmp)
r <- registry_file_parse("REUTERS", registry_dir = registry_rcppcwb)
r[["home"]] <- system.file(package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "reuters")
registry_file_write(r, corpus = "REUTERS", registry_dir = registry_tmp)

# decode structural attribute 'places'

s_attrs_places <- RcppCWB::s_attribute_decode(
  corpus = "REUTERS",
  data_dir = system.file(package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "reuters"),
  s_attribute = "places", method = "R"
)
s_attrs_places[["id"]] <- 1L:nrow(s_attrs_places)
setnames(s_attrs_places, old = "value", new = "places")

# decode positional attribute 'word'

tokens <- apply(s_attrs_places, 1, function(row){
  ids <- cl_cpos2id(
    corpus = "REUTERS", cpos = row[1]:row[2],
    p_attribute = "word", registry = registry_tmp
  )
  cl_id2str(corpus = "REUTERS", id = ids, p_attribute = "word", registry = registry_tmp)
})
tokenstream <- rbindlist(
lapply(
  1L:length(tokens),
  function(i) data.table(id = i, word = tokens[[i]]))
  )
tokenstream[["cpos"]] <- 0L:(nrow(tokenstream) - 1L)

# create CorpusData object (see vignette for further explanation)

CD <- CorpusData$new()
CD$tokenstream <- as.data.table(tokenstream)
CD$metadata <- as.data.table(s_attrs_places)

# Remove temporary registry with home dir still pointing to RcppCWB data dir
# to prevent data from being deleted
file.remove(fs::path(registry_tmp, "reuters"))
file.remove(registry_tmp)

# create temporary directories (registry directory and one for indexed corpora)

registry_tmp <- fs::path(tempdir(), "registry")
data_dir_tmp <- fs::path(tempdir(), "data_dir")
if (!dir.exists(registry_tmp)) dir.create(registry_tmp)
if (!dir.exists(data_dir_tmp)) dir.create(data_dir_tmp)

CD$encode(
  corpus = "REUTERS", encoding = "utf8",
  p_attributes = "word", s_attributes = "places",
  registry_dir = registry_tmp, data_dir = data_dir_tmp,
  method = "R"
)
reg <- registry_data(name = "REUTERS", id = "REUTERS", home = data_dir_tmp, p_attributes = "word")
registry_file_write(data = reg, corpus = "REUTERS", registry_dir = registry_tmp)

# see whether it works

cl_cpos2id(corpus = "REUTERS", p_attribute = "word", cpos = 0L:4049L, registry = registry_tmp)

[Package cwbtools version 0.4.2 Index]

Manage Corpus Data and Encode CWB Corpus.

Description

Details

Public fields

Methods

Public methods

Method new()

Usage

Returns

Method print()

Usage

Method tokenize()

Usage

Arguments

Method import_xml()

Usage

Arguments

Returns

Method add_corpus_positions()

Usage

Arguments

Method purge()

Usage

Arguments

Method encode()

Usage

Arguments

Method clone()

Usage

Arguments

Examples

Method `new()`

Method `print()`

Method `tokenize()`

Method `import_xml()`

Method `add_corpus_positions()`

Method `purge()`

Method `encode()`

Method `clone()`