R: Subsetting corpora and subcorpora

subset-method {polmineR}

R Documentation

Subsetting corpora and subcorpora

Description

The structural attributes of a corpus (s-attributes) can be used to generate subcorpora (i.e. a subcorpus class object) by applying the subset-method. To obtain a subcorpus, the subset-method can be applied on a corpus represented by a corpus object, a length-one character vector (as a shortcut), and on a subcorpus object.

Usage

## S4 method for signature 'corpus'
subset(x, subset, regex = FALSE, verbose = FALSE, ...)

## S4 method for signature 'character'
subset(x, ...)

## S4 method for signature 'subcorpus'
subset(x, subset, verbose = FALSE, ...)

## S4 method for signature 'remote_corpus'
subset(x, subset)

## S4 method for signature 'subcorpus_bundle'
subset(x, ..., iterate = FALSE, verbose = TRUE, progress = FALSE, mc = NULL)

Arguments

`x`	A `corpus` or `subcorpus` object. A corpus may also specified by a length-one `character` vector.
`subset`	A `logical` expression indicating elements or rows to keep. The expression may be unevaluated (using `quote()` or `bquote()`).
`regex`	A `logical` value. If `TRUE`, values for s-attributes defined using the three dots (...) are interpreted as regular expressions and passed into a `grep` call for subsetting a table with the regions and values of structural attributes. If `FALSE` (the default), values for s-attributes must match exactly.
`verbose`	A `logical` value, whether to show progress messages.
`...`	An expression that will be used to create a subcorpus from s-attributes.
`iterate`	A `logical` value, if `TRUE`, process very single object of `x` individually.
`progress`	A `logical` value, whether to display progress bar.
`mc`	An `integer` value, number of cores to use. If `NULL` (default), no multithreading.

Details

The default approach for subsetting a subcorpus_bundle is to temporarily merge objects into a single subcorpus, perform subset(), and restore subcorpus_bundle by splitting on the s-attribute of the input subcorpus_bundle. This approach may have unintended results, if x has been generated using complex criteria. This may be the case for instance, if x resulted from as.speeches(). In this scenario, set argument iterate to TRUE to iterate over objects in bundle one-by-one.

Value

A subcorpus object. If the expression provided by argument subset includes undefined s-attributes, a warning is issued and the return value is NULL.

Examples

use("polmineR")

# examples for standard and non-standard evaluation
a <- corpus("GERMAPARLMINI")

# subsetting a corpus object using non-standard evaluation
sc <- subset(a, speaker == "Angela Dorothea Merkel")
sc <- subset(a, speaker == "Angela Dorothea Merkel" & date == "2009-10-28")
sc <- subset(a, grepl("Merkel", speaker))
sc <- subset(a, grepl("Merkel", speaker) & date == "2009-10-28")

# subsetting corpus specified by character vector
sc <- subset("GERMAPARLMINI", grepl("Merkel", speaker))
sc <- subset("GERMAPARLMINI", speaker == "Angela Dorothea Merkel")
sc <- subset("GERMAPARLMINI", speaker == "Angela Dorothea Merkel" & date == "2009-10-28")
sc <- subset("GERMAPARLMINI", grepl("Merkel", speaker) & date == "2009-10-28")

# subsetting a corpus using the (old) logic of the partition-method
sc <- subset(a, speaker = "Angela Dorothea Merkel")
sc <- subset(a, speaker = "Angela Dorothea Merkel", date = "2009-10-28")
sc <- subset(a, speaker = "Merkel", regex = TRUE)
sc <- subset(a, speaker = c("Merkel", "Kauder"), regex = TRUE)
sc <- subset(a, speaker = "Merkel", date = "2009-10-28", regex = TRUE)

# providing the value for s-attribute as a variable
who <- "Volker Kauder"
sc <- subset(a, quote(speaker == !!who))

# quoting and quosures necessary when programming against subset
# note how variable who needs to be handled
gparl <- corpus("GERMAPARLMINI")
subcorpora <- lapply(
  c("Angela Dorothea Merkel", "Volker Kauder", "Ronald Pofalla"),
  function(who) subset(gparl, speaker == !!who)
)

# subset a subcorpus_bundle
merkel <- corpus("GERMAPARLMINI") %>%
  split(s_attribute = "protocol_date") %>%
  subset(speaker == "Angela Dorothea Merkel")

# iterate over objects in bundle one by one 
sp <- corpus("GERMAPARLMINI") %>%
  as.speeches(
    s_attribute_name = "speaker",
    s_attribute_date = "protocol_date",
    progress = FALSE
  ) %>%
  subset(interjection == "speech", iterate = TRUE, progress = FALSE)

[Package polmineR version 0.8.9 Index]