R: Split Text on Spaces

prepare_and_tokenize {piecemaker}

R Documentation

Split Text on Spaces

This is an extremely simple tokenizer that simply splits text on spaces. It also optionally applies the cleaning processes from prepare_text.

prepare_and_tokenize(text, prepare = TRUE, ...)

text

A character vector to clean.

prepare

Logical; should the text be passed through prepare_text?

...

Arguments passed on to prepare_text

squish_whitespace: Logical scalar; squish whitespace characters (using str_squish)?
remove_control_characters: Logical scalar; remove control characters?
remove_replacement_characters: Logical scalar; remove the "replacement character", U-FFFD?
remove_diacritics: Logical scalar; remove diacritical marks (accents, etc) from characters?
space_cjk: Logical scalar; add spaces around Chinese/Japanese/Korean ideographs?
space_punctuation: Logical scalar; add spaces around punctuation (to make it easier to keep punctuation during tokenization)?
remove_terminal_hyphens: Logical; should hyphens at the end of lines after a word be removed? For example, "un-\nbroken" would become "unbroken".
space_hyphens: Logical; treat hyphens between letters and at the start/end of words as punctuation? Other hyphens are always treated as punctuation.
space_abbreviations: Logical; treat apostrophes between letters as punctuation? Other apostrophes are always treated as punctuation.

The text as a list of character vectors. Each element of each vector is roughly equivalent to a word.

prepare_and_tokenize("This is some text.")
prepare_and_tokenize("This is some text.", space_punctuation = FALSE)

[Package piecemaker version 1.0.2 Index]