| corpus_install {cwbtools} | R Documentation | 
Install and manage corpora.
Description
Utility functions to assist the installation and management of indexed CWB corpora.
Usage
corpus_install(
  pkg = NULL,
  repo = "https://PolMine.github.io/drat/",
  tarball = NULL,
  doi = NULL,
  checksum = NULL,
  lib = .libPaths()[1],
  registry_dir,
  corpus_dir,
  ask = interactive(),
  load = TRUE,
  verbose = TRUE,
  user = NULL,
  password = NULL,
  ...
)
corpus_packages()
corpus_rename(
  old,
  new,
  registry_dir = Sys.getenv("CORPUS_REGISTRY"),
  verbose = TRUE
)
corpus_remove(corpus, registry_dir, ask = interactive(), verbose = TRUE)
corpus_as_tarball(
  corpus,
  registry_dir,
  data_dir = registry_file_parse(corpus, registry_dir)[["home"]],
  tarfile,
  verbose = TRUE
)
corpus_copy(
  corpus,
  registry_dir,
  data_dir = registry_file_parse(corpus, registry_dir)[["home"]],
  registry_dir_new = fs::path(tempdir(), "cwb", "registry"),
  data_dir_new = fs::path(tempdir(), "cwb", "indexed_corpora", tolower(corpus)),
  remove = FALSE,
  verbose = interactive(),
  progress = TRUE
)
corpus_recode(
  corpus,
  registry_dir = Sys.getenv("CORPUS_REGISTRY"),
  data_dir = registry_file_parse(corpus, registry_dir)[["home"]],
  skip = character(),
  to = c("latin1", "UTF-8"),
  verbose = TRUE
)
corpus_testload(
  corpus,
  registry_dir = Sys.getenv("CORPUS_REGISTRY"),
  verbose = TRUE
)
corpus_get_version(corpus, registry_dir = Sys.getenv("CORPUS_REGISTRY"))
corpus_reload(corpus, registry_dir, verbose = TRUE)
Arguments
| pkg | Name of a package (length-one  | 
| repo | URL of the repository. | 
| tarball | URL,  S3-URI or local filename of a tarball with a CWB indexed
corpus. If  | 
| doi | The DOI (Digital Object Identifier) of a corpus deposited at Zenodo (e.g. "10.5281/zenodo.3748858".) | 
| checksum | A length-one  | 
| lib | Directory for R packages, defaults to  | 
| registry_dir | The corpus registry directory. If missing, the result of
 | 
| corpus_dir | The directory that contains the data directories of indexed
corpora. If missing, the value of  | 
| ask | A  | 
| load | A  | 
| verbose | Logical, whether to be verbose. | 
| user | A user name that can be specified to download a corpus from a password protected site. | 
| password | A password that can be specified to download a corpus from a password protected site. | 
| ... | Further parameters that will be passed into  | 
| old | Name of the (old) corpus. | 
| new | Name of the (new) corpus. | 
| corpus | The ID of a CWB indexed corpus (in upper case). | 
| data_dir | The data directory where the files of the CWB corpus live. | 
| tarfile | Filename of tarball. | 
| registry_dir_new | Target directory with for (new) registry files. | 
| data_dir_new | Target directory for corpus files. | 
| remove | A  | 
| progress | Logical, whether to show a progress bar. | 
| skip | A character vector with s_attributes to skip. | 
| to | Character string describing the target encoding of the corpus. | 
Details
A CWB corpus consists a set of binary files with corpus data kept together in a data directory, and a registry file, which is a plain test file that details the corpus id, corpus properties, structural and positional attributes. The registry file also specifies the path to the corpus data directory. Typically, the registry directory and a corpus directory with the data directories for individual corpora are within one parent folder (which might be called "cwb" by default). See the following stylized directory structure.
  .
  |- registry/
  |  |- corpus1
  |  +- corpus2
  |
  + indexed_corpora/
    |- corpus1/
    |  |- file1
    |  |- file2
    |  +- file3
    |
    +- corpus2/
       |- file1
       |- file2
       +- file3
The corpus_install() function will assist the installation of a
corpus. The following scenarios are offered:
- If argument - tarballis a local tarball, the tarball will be extracted and files will be moved.
- If - tarballis a URL, the tarball will be downloaded from the online location. It is possible to state user credentials using the arguments- userand- password. Then the aforementioned installation (scenario 1) is executed. If argument- pkgis the name of an installed package, corpus files will be moved into this package.
- If argument - doiis Document Object Identifier (DOI), the URL from which a corpus tarball can be downloaded is derived from the information available at that location. The tarball is downloaded and the corpus installed. If argument- pkgis defined, files will be moved into a R package, the syste registry and corpus directories are used otherwise. Note that at this stage, it is assumed that the DOI has been awarded by Zenodo
- If argument - pkgis provided and- tarballis- NULL, corpora included in the package will be installed as system corpora, using the storage location specified by- registry_dir.
If the corpus to be installed is already available, a dialogue will ask the
user whether an existing corpus shall be deleted and installed anew, if
argument ask is TRUE.
corpus_packages() will detect the packages that include CWB
corpora. Note that the directory structure of all installed packages is
evaluated which may be slow on network-mounted file systems.
corpus_rename() will rename a corpus, affecting the name of the
registry file, the corpus id, and the name of the directory where data
files reside.
corpus_remove() can be used to delete a corpus.
corpus_as_tarball() will create a tarball (.tar.gz-file) with
two subdirectories. The 'registry' subdirectory will host the registry file
for the tarred corpus. The data files will be put in a subdirectory with
the corpus name in the 'indexed_corpora' subdirectory.
corpus_copy() will create a copy of a corpus (useful for
experimental modifications, for instance).
corpus_get_version parses the registry file and derives the
corpus version number from the corpus properties. The return value is a
numeric_version class object. The corpus version is expected to follow
semantic versioning (three digits, e.g. '0.8.1'). If the corpus version
has another format or if it is not available, the return value is NA.
corpus_reload() will unload a corpus if necessary and reload it.
Useful to make new features of a corpus available after modification.
Returns logical value TRUE if succesful, FALSE if not.
Value
Logical value TRUE if installation has been successful, or FALSE
if not.
See Also
For managing registry files, see registry_file_parse
for switching to a packaged corpus.
Examples
registry_file_new <- fs::path(tempdir(), "cwb", "registry", "reuters")
if (file.exists(registry_file_new)) file.remove(registry_file_new)
corpus_copy(
  corpus = "REUTERS",
  registry_dir = system.file(package = "RcppCWB", "extdata", "cwb", "registry"),
  data_dir = system.file(
    package = "RcppCWB",
    "extdata", "cwb", "indexed_corpora", "reuters"
  )
)
unlink(fs::path(tempdir(), "cwb"), recursive = TRUE)
corpus <- "REUTERS"
pkg <- "RcppCWB"
s_attr <- "places"
Q <- '"oil"'
registry_dir_src <- system.file(package = pkg, "extdata", "cwb", "registry")
data_dir_src <- system.file(package = pkg, "extdata", "cwb", "indexed_corpora", tolower(corpus))
registry_dir_tmp <- fs::path(tempdir(), "cwb", "registry")
registry_file_tmp <- fs::path(registry_dir_tmp, tolower(corpus))
data_dir_tmp <- fs::path(tempdir(), "cwb", "indexed_corpora", tolower(corpus))
if (file.exists(registry_file_tmp)) file.remove(registry_file_tmp)
if (!dir.exists(data_dir_tmp)){
   dir.create(data_dir_tmp, recursive = TRUE)
} else {
  if (length(list.files(data_dir_tmp)) > 0L)
    file.remove(list.files(data_dir_tmp, full.names = TRUE))
}
corpus_copy(
  corpus = corpus,
  registry_dir = registry_dir_src,
  data_dir = data_dir_src,
  registry_dir_new = registry_dir_tmp,
  data_dir_new = data_dir_tmp
)
RcppCWB::cl_charset_name(corpus = corpus, registry = registry_dir_tmp)
corpus_recode(
  corpus = corpus,
  registry_dir = registry_dir_tmp,
  data_dir = data_dir_tmp,
  to = "UTF-8"
)
RcppCWB::cl_delete_corpus(corpus = corpus, registry = registry_dir_tmp)
RcppCWB::cqp_initialize(registry_dir_tmp)
RcppCWB::cl_charset_name(corpus = corpus, registry = registry_dir_tmp)
n_strucs <- RcppCWB::cl_attribute_size(
  corpus = corpus, attribute = s_attr, attribute_type = "s", registry = registry_dir_tmp
)
strucs <- 0L:(n_strucs - 1L)
struc_values <- RcppCWB::cl_struc2str(
  corpus = corpus, s_attribute = s_attr, struc = strucs, registry = registry_dir_tmp
)
speakers <- unique(struc_values)
Sys.setenv("CORPUS_REGISTRY" = registry_dir_tmp)
if (RcppCWB::cqp_is_initialized()) RcppCWB::cqp_reset_registry() else RcppCWB::cqp_initialize()
RcppCWB::cqp_query(corpus = corpus, query = Q)
cpos <- RcppCWB::cqp_dump_subcorpus(corpus = corpus)
ids <- RcppCWB::cl_cpos2id(
  corpus = corpus, p_attribute = "word", registry = registry_dir_tmp, cpos = cpos
)
str <- RcppCWB::cl_id2str(
  corpus = corpus, p_attribute = "word", registry = registry_dir_tmp, id = ids
)
unique(str)
unlink(fs::path(tempdir(), "cwb"), recursive = TRUE)