prepare_and_tokenize {piecemaker} | R Documentation |
Split Text on Spaces
Description
This is an extremely simple tokenizer that simply splits text on spaces. It
also optionally applies the cleaning processes from
prepare_text
.
Usage
prepare_and_tokenize(text, prepare = TRUE, ...)
Arguments
text |
A character vector to clean.
|
prepare |
Logical; should the text be passed through
prepare_text ?
|
... |
Arguments passed on to prepare_text
squish_whitespace Logical scalar; squish whitespace characters (using
str_squish )?
remove_control_characters Logical scalar; remove control characters?
remove_replacement_characters Logical scalar; remove the "replacement
character", U-FFFD ?
remove_diacritics Logical scalar; remove diacritical marks (accents,
etc) from characters?
space_cjk Logical scalar; add spaces around Chinese/Japanese/Korean
ideographs?
space_punctuation Logical scalar; add spaces around punctuation (to
make it easier to keep punctuation during tokenization)?
remove_terminal_hyphens Logical; should hyphens at the end of lines
after a word be removed? For example, "un-\nbroken" would become
"unbroken".
space_hyphens Logical; treat hyphens between letters and at the
start/end of words as punctuation? Other hyphens are always treated as
punctuation.
space_abbreviations Logical; treat apostrophes between letters as
punctuation? Other apostrophes are always treated as punctuation.
|
Value
The text as a list of character vectors. Each element of each vector
is roughly equivalent to a word.
Examples
prepare_and_tokenize("This is some text.")
prepare_and_tokenize("This is some text.", space_punctuation = FALSE)
[Package
piecemaker version 1.0.2
Index]