| CorpusData {cwbtools} | R Documentation |
Manage Corpus Data and Encode CWB Corpus.
Description
Manage Corpus Data and Encode CWB Corpus.
Manage Corpus Data and Encode CWB Corpus.
Details
See the CWB Encoding Tutorial on characters allowed for encoding attributes: "By convention, all attribute names must be lowercase (more precisely, they may only contain the characters a-z, 0-9, -, and _, and may not start with a digit). Therefore, the names of XML elements to be included in the CWB corpus must not contain any non-ASCII or uppercase letters." (section 2)
Import XML files.
Public fields
chunktableA
data.tablewith column "id" (unique values), columns with metadata, and a column with text chunks.tokenstreamA
data.tablewith a column "cpos" (corpus position), and columns with positional attributes, such as "word", "lemma", "pos", "stem".metadataA
data.tablewith a column "id", to link data with chunks/tokenstream, columns with document-level metadata, and a column "cpos_left" and "cpos_right", which can be generated using method$add_corpus_positions().sentencesA
data.table.named_entitiesA
data.table.
Methods
Public methods
Method new()
Initialize a new instance of class CorpusData.
Usage
CorpusData$new()
Returns
A class CorpusData object.
Method print()
Print summary of CorpusData object.
Usage
CorpusData$print()
Method tokenize()
Simple tokenization of text in chunktable.
Usage
CorpusData$tokenize(..., verbose = TRUE, progress = TRUE)
Arguments
...Arguments that are passed into
tokenizers::tokenize_words().verboseA logical value, whether to be verbose.
progressA logical value, whether to show progress bar.
Method import_xml()
Usage
CorpusData$import_xml( filenames, body = "//body", meta = NULL, mc = NULL, progress = TRUE )
Arguments
filenamesA vector of files to process.
bodyAn xpath expression defining the body of the XML document.
metaA named character vector with XPath expressions.
mcA numeric/integer value, number of cores to use.
progressA logical value, whether to show progress bar.
Returns
The CorpusData object is returned invisibly.
Method add_corpus_positions()
Add column 'cpos' to tokenstream and columns 'cpos_left' and 'cpos_right' to metadata.
Usage
CorpusData$add_corpus_positions(verbose = TRUE)
Arguments
verboseA logical value, whether to be verbose.
Method purge()
Remove patterns from chunkdata that are known to cause problems. This is done most efficiently at the chunkdata level of data preparation as the length of the character vector to handle is much smaller than when tokenization/annotation has been performed.
Usage
CorpusData$purge(
replacements = list(c("^\\s*<.*?>\\s*$", ""), c("’", "'"))
)Arguments
replacementsA list of length-two character vectors with regular expressions and replacements.
Method encode()
Encode corpus. If the corpus already exists, it will be removed.
Usage
CorpusData$encode(
corpus,
p_attributes = "word",
s_attributes = NULL,
encoding,
registry_dir = Sys.getenv("CORPUS_REGISTRY"),
data_dir = NULL,
method = c("R", "CWB"),
verbose = TRUE,
compress = FALSE,
reload = TRUE,
quietly = TRUE
)Arguments
corpusThe name of the CWB corpus.
p_attributesPositional attributes.
s_attributesColumns that will be encoded as structural attributes.
encodingEncoding/charset of the CWB corpus.
registry_dirCorpus registry, the directory where registry files are stored.
data_dirDirectory where to create directory for indexed corpus files.
methodEither "R" or "CWB".
verboseA logical value, whether to be verbose.
compressA logical value, whether to compress corpus.
reloadA
logicalvalue, whether to reload corpus.quietlyA
logicalvalue passed intoRcppCWB::cwb_makeall(),RcppCWB::cwb_huffcode()andRcppCWB::cwb_compress_rdxto control verbosity of these functions.
Method clone()
The objects of this class are cloneable with this method.
Usage
CorpusData$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
Examples
library(RcppCWB)
library(data.table)
# this example relies on the R method to write data to disk, there is also a method "CWB"
# that relies on CWB tools to generate the indexed corpus. The CWB can downloaded
# and installed within the package by calling cwb_install()
# create temporary registry file so that data in RcppCWB package can be used
registry_rcppcwb <- system.file(package = "RcppCWB", "extdata", "cwb", "registry")
registry_tmp <- fs::path(tempdir(), "registry")
if (!dir.exists(registry_tmp)) dir.create(registry_tmp)
r <- registry_file_parse("REUTERS", registry_dir = registry_rcppcwb)
r[["home"]] <- system.file(package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "reuters")
registry_file_write(r, corpus = "REUTERS", registry_dir = registry_tmp)
# decode structural attribute 'places'
s_attrs_places <- RcppCWB::s_attribute_decode(
corpus = "REUTERS",
data_dir = system.file(package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "reuters"),
s_attribute = "places", method = "R"
)
s_attrs_places[["id"]] <- 1L:nrow(s_attrs_places)
setnames(s_attrs_places, old = "value", new = "places")
# decode positional attribute 'word'
tokens <- apply(s_attrs_places, 1, function(row){
ids <- cl_cpos2id(
corpus = "REUTERS", cpos = row[1]:row[2],
p_attribute = "word", registry = registry_tmp
)
cl_id2str(corpus = "REUTERS", id = ids, p_attribute = "word", registry = registry_tmp)
})
tokenstream <- rbindlist(
lapply(
1L:length(tokens),
function(i) data.table(id = i, word = tokens[[i]]))
)
tokenstream[["cpos"]] <- 0L:(nrow(tokenstream) - 1L)
# create CorpusData object (see vignette for further explanation)
CD <- CorpusData$new()
CD$tokenstream <- as.data.table(tokenstream)
CD$metadata <- as.data.table(s_attrs_places)
# Remove temporary registry with home dir still pointing to RcppCWB data dir
# to prevent data from being deleted
file.remove(fs::path(registry_tmp, "reuters"))
file.remove(registry_tmp)
# create temporary directories (registry directory and one for indexed corpora)
registry_tmp <- fs::path(tempdir(), "registry")
data_dir_tmp <- fs::path(tempdir(), "data_dir")
if (!dir.exists(registry_tmp)) dir.create(registry_tmp)
if (!dir.exists(data_dir_tmp)) dir.create(data_dir_tmp)
CD$encode(
corpus = "REUTERS", encoding = "utf8",
p_attributes = "word", s_attributes = "places",
registry_dir = registry_tmp, data_dir = data_dir_tmp,
method = "R"
)
reg <- registry_data(name = "REUTERS", id = "REUTERS", home = data_dir_tmp, p_attributes = "word")
registry_file_write(data = reg, corpus = "REUTERS", registry_dir = registry_tmp)
# see whether it works
cl_cpos2id(corpus = "REUTERS", p_attribute = "word", cpos = 0L:4049L, registry = registry_tmp)