tokens_split {quanteda} | R Documentation |
Split tokens by a separator pattern
Description
Replaces tokens by multiple replacements consisting of elements split by a
separator pattern, with the option of retaining the separator. This function
effectively reverses the operation of tokens_compound()
.
Usage
tokens_split(
x,
separator = " ",
valuetype = c("fixed", "regex"),
remove_separator = TRUE,
apply_if = NULL
)
Arguments
x |
a tokens object |
separator |
a single-character pattern match by which tokens are separated |
valuetype |
the type of pattern matching: |
remove_separator |
if |
apply_if |
logical vector of length |
Examples
# undo tokens_compound()
toks1 <- tokens("pork barrel is an idiomatic multi-word expression")
tokens_compound(toks1, phrase("pork barrel"))
tokens_compound(toks1, phrase("pork barrel")) |>
tokens_split(separator = "_")
# similar to tokens(x, remove_hyphen = TRUE) but post-tokenization
toks2 <- tokens("UK-EU negotiation is not going anywhere as of 2018-12-24.")
tokens_split(toks2, separator = "-", remove_separator = FALSE)
[Package quanteda version 4.0.2 Index]