char_select {quanteda}R Documentation

Select or remove elements from a character vector

Description

These function select or discard elements from a character object. For convenience, the functions char_remove and char_keep are defined as shortcuts for char_select(x, pattern, selection = "remove") and char_select(x, pattern, selection = "keep"), respectively.

These functions make it easy to change, for instance, stopwords based on pattern matching.

Usage

char_select(
  x,
  pattern,
  selection = c("keep", "remove"),
  valuetype = c("glob", "fixed", "regex"),
  case_insensitive = TRUE
)

char_remove(x, ...)

char_keep(x, ...)

Arguments

x

an input character vector

pattern

a character vector, list of character vectors, dictionary, or collocations object. See pattern for details.

selection

whether to "keep" or "remove" the tokens matching pattern

valuetype

the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

logical; if TRUE, ignore case when matching a pattern or dictionary values

...

additional arguments passed by char_remove and char_keep to char_select. Cannot include selection.

Value

a modified character vector

Examples

# character selection
mykeywords <- c("natural", "national", "denatured", "other")
char_select(mykeywords, "nat*", valuetype = "glob")
char_select(mykeywords, "nat", valuetype = "regex")
char_select(mykeywords, c("natur*", "other"))
char_select(mykeywords, c("natur*", "other"), selection = "remove")

# character removal
char_remove(letters[1:5], c("a", "c", "x"))
words <- c("any", "and", "Anna", "as", "announce", "but")
char_remove(words, "an*")
char_remove(words, "an*", case_insensitive = FALSE)
char_remove(words, "^.n.+$", valuetype = "regex")

# remove some of the system stopwords
stopwords("en", source = "snowball")[1:6]
stopwords("en", source = "snowball")[1:6] |>
  char_remove(c("me", "my*"))
  
# character keep
char_keep(letters[1:5], c("a", "c", "x"))

[Package quanteda version 4.0.2 Index]