R: Read, process and write data on structural attributes.

s_attribute_encode {cwbtools}

R Documentation

Read, process and write data on structural attributes.

Description

Read, process and write data on structural attributes.

Usage

s_attribute_encode(
  values,
  data_dir,
  s_attribute,
  corpus,
  region_matrix,
  method = c("R", "CWB"),
  registry_dir = Sys.getenv("CORPUS_REGISTRY"),
  encoding,
  delete = FALSE,
  verbose = TRUE
)

s_attribute_recode(
  data_dir,
  s_attribute,
  from = c("UTF-8", "latin1"),
  to = c("UTF-8", "latin1")
)

s_attribute_files(s_attribute, data_dir)

s_attribute_get_values(s_attribute, data_dir)

s_attribute_get_regions(s_attribute, data_dir)

s_attribute_merge(x, y)

s_attribute_delete(corpus, s_attribute)

s_attribute_rename(corpus, old, new, registry_dir, verbose = TRUE)

Arguments

`values`	A `character` vector with the values of the structural attribute.
`data_dir`	The data directory where to write the files.
`s_attribute`	Name of the structural attribute, an atomic `character` vector containing only lowercase ASCII characters (a-z), digits (0-9), -, and _: No non-ASCII or uppercase letters allowed.
`corpus`	A CWB corpus.
`region_matrix`	A two-column `matrix` with corpus positions.
`method`	Either 'R' or 'CWB'.
`registry_dir`	Path name of the registry directory.
`encoding`	Encoding of the data.
`delete`	Logical, whether to call `RcppCWB::cl_delete_corpus()`.
`verbose`	Logical.
`from`	Character string describing the current encoding of the attribute.
`to`	Character string describing the target encoding of the attribute.
`x`	Data defining a first s-attribute, a `data.table` (or an object coercible to a `data.table`) with three columns ("cpos_left", "cpos_right", "value").
`y`	Data defining a second s-attribute, a `data.table` (or an object coercible to a `data.table`) with three columns ("cpos_left", "cpos_right", "value").
`old`	A `character` vector with s-attributes to be renamed.
`new`	A `character` vector with new names of s-attributes. The vector needs to have the same length as vector `old`. The 1st, 2nd, 3rd ... nth attribute stated in vector `old` will get the new names at the 1st, 2nd, 3rd, ... nth position of vector `new`.

Details

s_attribute_encode() implements a 'pure R' implementation to add or modify structural attributes of an existing CWB corpus.

If the corpus has been loaded/used before, a new s-attribute may not be available unless RcppCWB::cl_delete_corpus() has been called. Use the argument delete for calling this function.

s_attribute_recode will recode the values in the avs-file and change the attribute value index in the avx file. The rng-file remains unchanged. The registry file remains unchanged, and it is highly recommended to consider s_attribute_recode as a helper for corpus_recode that will recode all s-attributes, all p-attributes, and will reset the encoding in the registry file.

s_attribute_files() will return a named character vector with the data files (extensions: "avs", "avx", "rng") in the directory indicated by data_dir for the structural attribute s_attribute.

s_attribute_get_values() is equivalent to performing the CL function cl_struc2id for all strucs of a structural attribute. It is a "pure R" operation that is faster than using CL, as it processes entire files for the s-attribute directly. The return value is a character vector with all string values for the s-attribute.

s_attribute_get_regions will return a two-column integer matrix with regions for the strucs of a given s-attribute. Left corpus positions are in the first column, right corpus positions in the second column. The result is equivalent to calling RcppCWB::get_region_matrix for all strucs of a s-attribute, but may be somewhat faster. It is a "pure R" function which is fast as it processes files entirely and directly.

s_attribute_merge() combines two tables with regions for s-attributes checking for intersections that may cause problems. The heuristic is to keep all non-intersecting annotations and those annotations that define the same region in object x and object y. Annotations of x and y which overlap uncleanly, i.e. without an identity of the left and the right corpus position ("cpos_left" / "cpos_right") are dropped. The scenario for using the function is to decode a s-attribute (using s_attribute_decode()), mix in an additional annotation, and to re-encode the enhanced s-attribute (using s_attribute_encode()).

Function s_attribute_delete() is not yet implemented.

Function s_attribute_rename() can be used to rename a structural attribute.

Examples

require("RcppCWB")
registry_tmp <- fs::path(tempdir(), "cwb", "registry")
data_dir_tmp <- fs::path(tempdir(), "cwb", "indexed_corpora", "reuters")

cwb_dir_rcppcwb <- system.file(package = "RcppCWB", "extdata", "cwb")
registry_dir_rcppcwb <- fs::path(cwb_dir_rcppcwb, "registry")
data_dir_rcppcwb <- fs::path(cwb_dir_rcppcwb,"indexed_corpora", "reuters")

corpus_copy(
  corpus = "REUTERS",
  registry_dir = registry_dir_rcppcwb,
  data_dir = data_dir_rcppcwb,
  registry_dir_new = registry_tmp,
  data_dir_new = data_dir_tmp
)

no_strucs <- cl_attribute_size(
  corpus = "REUTERS",
  attribute = "id",
  attribute_type = "s",
  registry = registry_tmp
)

cpos_matrix <- get_region_matrix(
      corpus = "REUTERS",
      struc = 0L:(no_strucs - 1L),
      s_attribute = "id",
      registry = registry_tmp
)

s_attribute_encode(
  values = 1L:nrow(cpos_matrix),
  data_dir = data_dir_tmp,
  s_attribute = "article_id",
  corpus = "REUTERS",
  region_matrix = cpos_matrix,
  method = "R",
  registry_dir = registry_tmp,
  encoding = "latin1",
  verbose = TRUE,
  delete = TRUE
)

cl_struc2str(
  "REUTERS",
  struc = 0L:(nrow(cpos_matrix) - 1L),
  s_attribute = "article_id",
  registry = registry_tmp
)

unlink(registry_tmp, recursive = TRUE)
unlink(data_dir_tmp, recursive = TRUE)
data_dir <- system.file(
  package = "RcppCWB",
  "extdata",
  "cwb",
  "indexed_corpora",
  "reuters"
)
avs <- s_attribute_get_values(s_attribute = "id", data_dir = data_dir)
rng <- s_attribute_get_regions(
  s_attribute = "id",
  data_dir = system.file(package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "reuters")
)
x <- data.frame(
  cpos_left =  c(1L, 5L, 10L, 20L, 25L),
  cpos_right = c(2L, 5L, 12L, 21L, 27L),
  value = c("ORG", "LOC", "ORG", "PERS", "ORG"),
  stringsAsFactors = FALSE
)
y <- data.frame(
  cpos_left =  c(5, 11, 20, 25L, 30L),
  cpos_right = c(5, 12, 22, 27L, 33L),
  value = c("LOC", "ORG", "ORG", "ORG", "ORG"),
  stringsAsFactors = FALSE
)
s_attribute_merge(x,y)

[Package cwbtools version 0.4.2 Index]

Read, process and write data on structural attributes.

Description

Usage

Arguments

Details

See Also

Examples