regions {polmineR} | R Documentation |
Regions of a CWB corpus.
Description
Class to store and process the regions of a corpus. Regions are defined by start and end corpus positions and correspond to a set of tokens surrounded by start and end XML tags.
Usage
regions(x, s_attribute)
## S4 method for signature 'corpus'
regions(x, s_attribute)
## S4 method for signature 'subcorpus'
regions(x, s_attribute)
as.regions(x, ...)
## S3 method for class 'regions'
as.data.table(x, keep.rownames, values = NULL, ...)
Arguments
x |
object of class |
s_attribute |
An s-attribute denoted by a length-one |
... |
Further arguments. |
keep.rownames |
Required argument to safeguard consistency with S3
method definition in the |
values |
values to assign to a column that will be added |
Details
The regions
class is a minimal representation of regions and does not
include information on the "strucs" (region IDs) that are used internally to
obtain values of s-attributes or information, which combination of conditions
on s-attributes has been used to obtain regions. This is left to the
subcorpus
corpus class. Whereas the subcorpus
class is associated with
the assumption, that a set of regions is a meaningful sub-unit of a corpus,
the regions
class has a focus on the individual sequences of tokens defined
by a structural attribute (such as paragraphs, sentences, named entities).
Information on regions is maintained in the cpos
slot of the regions
S4
class: A two-column matrix
with begin and end corpus positions (first and
second column, respectively). All other slots are inherited from the corpus
class.
The understanding of "regions" is modelled on the usage of terms by CWB developers. As it is put in the CQP Interface and Query Language Manual: "Matching pairs of XML start and end tags are encoded as token regions, identified by the corpus positions of the first token (immediately following the start tag) and the last token (immediately preceding the end tag) of the region." (p. 6)
The as.regions
-method coerces objects to a regions
-object.
The as.data.table
method returns the matrix with corpus
positions in the slot cpos
as a data.table
.
Slots
cpos
A two-column
matrix
with start and end corpus positions (first and second column, respectively).
See Also
Other classes to manage corpora:
corpus-class
,
phrases-class
,
ranges-class
,
subcorpus
Examples
use("polmineR")
P <- partition("GERMAPARLMINI", date = "2009-11-12", speaker = "Jens Spahn")
R <- as.regions(P)
use(pkg = "RcppCWB", corpus = "REUTERS")
# Get regions matrix as data.table, without / with values
sc <- corpus("REUTERS") %>% subset(grep("saudi-arabia", places))
regions_dt <- as.data.table(sc)
regions_dt <- as.data.table(
sc,
values = s_attributes(sc, "id", unique = FALSE)
)