s_attribute_encode {cwbtools}R Documentation

Read, process and write data on structural attributes.

Description

Read, process and write data on structural attributes.

Usage

s_attribute_encode(
  values,
  data_dir,
  s_attribute,
  corpus,
  region_matrix,
  method = c("R", "CWB"),
  registry_dir = Sys.getenv("CORPUS_REGISTRY"),
  encoding,
  delete = FALSE,
  verbose = TRUE
)

s_attribute_recode(
  data_dir,
  s_attribute,
  from = c("UTF-8", "latin1"),
  to = c("UTF-8", "latin1")
)

s_attribute_files(s_attribute, data_dir)

s_attribute_get_values(s_attribute, data_dir)

s_attribute_get_regions(s_attribute, data_dir)

s_attribute_merge(x, y)

s_attribute_delete(corpus, s_attribute)

s_attribute_rename(corpus, old, new, registry_dir, verbose = TRUE)

Arguments

values

A character vector with the values of the structural attribute.

data_dir

The data directory where to write the files.

s_attribute

Name of the structural attribute, an atomic character vector containing only lowercase ASCII characters (a-z), digits (0-9), -, and _: No non-ASCII or uppercase letters allowed.

corpus

A CWB corpus.

region_matrix

A two-column matrix with corpus positions.

method

Either 'R' or 'CWB'.

registry_dir

Path name of the registry directory.

encoding

Encoding of the data.

delete

Logical, whether to call RcppCWB::cl_delete_corpus().

verbose

Logical.

from

Character string describing the current encoding of the attribute.

to

Character string describing the target encoding of the attribute.

x

Data defining a first s-attribute, a data.table (or an object coercible to a data.table) with three columns ("cpos_left", "cpos_right", "value").

y

Data defining a second s-attribute, a data.table (or an object coercible to a data.table) with three columns ("cpos_left", "cpos_right", "value").

old

A character vector with s-attributes to be renamed.

new

A character vector with new names of s-attributes. The vector needs to have the same length as vector old. The 1st, 2nd, 3rd ... nth attribute stated in vector old will get the new names at the 1st, 2nd, 3rd, ... nth position of vector new.

Details

s_attribute_encode() implements a 'pure R' implementation to add or modify structural attributes of an existing CWB corpus.

If the corpus has been loaded/used before, a new s-attribute may not be available unless RcppCWB::cl_delete_corpus() has been called. Use the argument delete for calling this function.

s_attribute_recode will recode the values in the avs-file and change the attribute value index in the avx file. The rng-file remains unchanged. The registry file remains unchanged, and it is highly recommended to consider s_attribute_recode as a helper for corpus_recode that will recode all s-attributes, all p-attributes, and will reset the encoding in the registry file.

s_attribute_files() will return a named character vector with the data files (extensions: "avs", "avx", "rng") in the directory indicated by data_dir for the structural attribute s_attribute.

s_attribute_get_values() is equivalent to performing the CL function cl_struc2id for all strucs of a structural attribute. It is a "pure R" operation that is faster than using CL, as it processes entire files for the s-attribute directly. The return value is a character vector with all string values for the s-attribute.

s_attribute_get_regions will return a two-column integer matrix with regions for the strucs of a given s-attribute. Left corpus positions are in the first column, right corpus positions in the second column. The result is equivalent to calling RcppCWB::get_region_matrix for all strucs of a s-attribute, but may be somewhat faster. It is a "pure R" function which is fast as it processes files entirely and directly.

s_attribute_merge() combines two tables with regions for s-attributes checking for intersections that may cause problems. The heuristic is to keep all non-intersecting annotations and those annotations that define the same region in object x and object y. Annotations of x and y which overlap uncleanly, i.e. without an identity of the left and the right corpus position ("cpos_left" / "cpos_right") are dropped. The scenario for using the function is to decode a s-attribute (using s_attribute_decode()), mix in an additional annotation, and to re-encode the enhanced s-attribute (using s_attribute_encode()).

Function s_attribute_delete() is not yet implemented.

Function s_attribute_rename() can be used to rename a structural attribute.

See Also

To decode a structural attribute, see s_attribute_decode.

Examples

require("RcppCWB")
registry_tmp <- fs::path(tempdir(), "cwb", "registry")
data_dir_tmp <- fs::path(tempdir(), "cwb", "indexed_corpora", "reuters")

cwb_dir_rcppcwb <- system.file(package = "RcppCWB", "extdata", "cwb")
registry_dir_rcppcwb <- fs::path(cwb_dir_rcppcwb, "registry")
data_dir_rcppcwb <- fs::path(cwb_dir_rcppcwb,"indexed_corpora", "reuters")

corpus_copy(
  corpus = "REUTERS",
  registry_dir = registry_dir_rcppcwb,
  data_dir = data_dir_rcppcwb,
  registry_dir_new = registry_tmp,
  data_dir_new = data_dir_tmp
)

no_strucs <- cl_attribute_size(
  corpus = "REUTERS",
  attribute = "id",
  attribute_type = "s",
  registry = registry_tmp
)

cpos_matrix <- get_region_matrix(
      corpus = "REUTERS",
      struc = 0L:(no_strucs - 1L),
      s_attribute = "id",
      registry = registry_tmp
)

s_attribute_encode(
  values = 1L:nrow(cpos_matrix),
  data_dir = data_dir_tmp,
  s_attribute = "article_id",
  corpus = "REUTERS",
  region_matrix = cpos_matrix,
  method = "R",
  registry_dir = registry_tmp,
  encoding = "latin1",
  verbose = TRUE,
  delete = TRUE
)

cl_struc2str(
  "REUTERS",
  struc = 0L:(nrow(cpos_matrix) - 1L),
  s_attribute = "article_id",
  registry = registry_tmp
)

unlink(registry_tmp, recursive = TRUE)
unlink(data_dir_tmp, recursive = TRUE)
data_dir <- system.file(
  package = "RcppCWB",
  "extdata",
  "cwb",
  "indexed_corpora",
  "reuters"
)
avs <- s_attribute_get_values(s_attribute = "id", data_dir = data_dir)
rng <- s_attribute_get_regions(
  s_attribute = "id",
  data_dir = system.file(package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "reuters")
)
x <- data.frame(
  cpos_left =  c(1L, 5L, 10L, 20L, 25L),
  cpos_right = c(2L, 5L, 12L, 21L, 27L),
  value = c("ORG", "LOC", "ORG", "PERS", "ORG"),
  stringsAsFactors = FALSE
)
y <- data.frame(
  cpos_left =  c(5, 11, 20, 25L, 30L),
  cpos_right = c(5, 12, 22, 27L, 33L),
  value = c("LOC", "ORG", "ORG", "ORG", "ORG"),
  stringsAsFactors = FALSE
)
s_attribute_merge(x,y)

[Package cwbtools version 0.4.0 Index]