p_attribute_encode {cwbtools} | R Documentation |
Encode Positional Attribute(s).
Description
Generate positional attribute from a character vector of tokens (the token stream).
Usage
p_attribute_encode(
token_stream,
p_attribute = "word",
registry_dir,
corpus,
data_dir,
method = c("R", "CWB"),
verbose = TRUE,
quietly = FALSE,
encoding = get_encoding(token_stream),
compress = FALSE,
reload = TRUE
)
p_attribute_recode(
data_dir,
p_attribute,
from = c("UTF-8", "latin1"),
to = c("UTF-8", "latin1")
)
p_attribute_rename(
corpus,
old,
new,
registry_dir,
verbose = TRUE,
dryrun = FALSE
)
Arguments
token_stream |
A |
p_attribute |
The positional attribute to create - a |
registry_dir |
Registry directory. |
corpus |
ID of the CWB corpus to create. |
data_dir |
The data directory for the binary files of the corpus. |
method |
Either 'CWB' or 'R', defaults to 'R'. See section 'Details'. |
verbose |
A |
quietly |
A |
encoding |
Encoding as defined in the charset corpus property of the registry file for the corpus ('latin1' to 'latin9', and 'utf8'). |
compress |
A |
reload |
A |
from |
Character string describing the current encoding of the attribute. |
to |
Character string describing the target encoding of the attribute. |
old |
A |
new |
A |
dryrun |
A |
Details
Four steps generate the binary CWB corpus data format for positional attributes: (1) Encode the token stream of the corpus, (2) create index files, (3) compress token stream and (4) compress index files. Whereas steps 1 and 2 are required to make a corpus work, steps 3 and 4 are optional yet useful to reduce disk usage and improve performance. See the CQP Corpus Encoding Tutorial (sections 2-4) for an explanation of the procedure.
p_attribute_encode()
offers an R and a CWB implementation controlled by
argument method
. When choosing method 'R', the token stream is encoded in
'pure R', then the C implementation of CWB functionality as exposed to R via
the RcppCWB package is used (functions RcppCWB::cwb_makeall()
for indexing,
RcppCWB::cwb_huffcode()
and RcppCWB::cwb_compress_rdx()
for compression).
When choosing method 'CWB', the token stream is written to disk, then CWB
command line utilities 'cwb-encode', cwb-makeall', 'cwb-huffcode' and
'cwb-compress-rdx' are called using system2()
. The CWB-method requires an
installation of the 'CWB'. The cwb_install()
function will download and #
install the CWB command line tools within the package. The 'CWB'-method is
still supported as it is used in the test suite of the packaage. The
'R'-method is robust and is recommended.
p_attribute_recode()
will recode the values in the avs-file and
change the attribute value index in the avx file. The rng-file remains
unchanged. The registry file remains unchanged, and it is highly
recommended to consider s_attribute_recode()
as a helper for
corpus_recode()
that will recode all s-attributes, all p-attributes, and
will reset the encoding in the registry file.
Function p_attribute_rename()
can be used to rename a
positional attribute. Note that the corpus is not refreshed (unloaded,
re-loaded), so it may be necessary to restart R for changes to become
effective.
Value
TRUE
is returned invisibly, if encoding has been successful.
FALSE
indicates an error has occurred.
Author(s)
Christoph Leonhardt, Andreas Blaette
Examples
# In this example, we follow a "pure R" approach.
library(dplyr)
reu <- system.file(package = "RcppCWB", "extdata", "examples", "reuters.txt")
tokens <- readLines(reu)
# Create new (and empty) directory structure
registry_tmp <- fs::path(tempdir(), "registry")
data_dir_tmp <- fs::path(tempdir(), "data_dir", "reuters")
if (dir.exists(registry_tmp)) unlink(registry_tmp, recursive = TRUE)
if (dir.exists(data_dir_tmp)) unlink(data_dir_tmp, recursive = TRUE)
dir.create(registry_tmp)
dir.create(data_dir_tmp, recursive = TRUE)
# Encode token stream (without compression)
p_attribute_encode(
corpus = "reuters",
token_stream = tokens,
p_attribute = "word",
data_dir = data_dir_tmp,
registry_dir = registry_tmp,
method = "R",
compress = FALSE,
quietly = TRUE,
encoding = "utf8"
)
# Augment registry file
registry_file_parse(corpus = "REUTERS", registry_dir = registry_tmp) %>%
registry_set_name("Reuters Sample Corpus") %>%
registry_set_property("charset", "utf8") %>%
registry_set_property("language", "en") %>%
registry_set_property("build_date", as.character(Sys.Date())) %>%
registry_file_write()
# Run query as a test
library(RcppCWB)
cqp_query(corpus = "REUTERS", query = '[]{3} "oil" []{3};')
regions <- cqp_dump_subcorpus(corpus = "REUTERS")
kwic <- apply(
regions, 1,
function(region){
ids <- cl_cpos2id(
"REUTERS",
p_attribute = "word",
registry = registry_tmp,
cpos = region[1]:region[2]
)
words <- cl_id2str(
corpus = "REUTERS",
p_attribute = "word",
registry = registry_tmp,
id = ids
)
paste0(words, collapse = " ")
}
)
kwic[1:10]