prepare_text {piecemaker} | R Documentation |
Prepare Text for Tokenization
Description
This function combines the other functions in this package to prepare text for tokenization. The text gets converted to valid UTF-8 (if possible), and then various cleaning functions are applied.
Usage
prepare_text(
text,
squish_whitespace = TRUE,
remove_terminal_hyphens = TRUE,
remove_control_characters = TRUE,
remove_replacement_characters = TRUE,
remove_diacritics = TRUE,
space_cjk = TRUE,
space_punctuation = TRUE,
space_hyphens = TRUE,
space_abbreviations = TRUE
)
Arguments
text |
A character vector to clean. |
squish_whitespace |
Logical scalar; squish whitespace characters (using
|
remove_terminal_hyphens |
Logical; should hyphens at the end of lines after a word be removed? For example, "un-\nbroken" would become "unbroken". |
remove_control_characters |
Logical scalar; remove control characters? |
remove_replacement_characters |
Logical scalar; remove the "replacement
character", |
remove_diacritics |
Logical scalar; remove diacritical marks (accents, etc) from characters? |
space_cjk |
Logical scalar; add spaces around Chinese/Japanese/Korean ideographs? |
space_punctuation |
Logical scalar; add spaces around punctuation (to make it easier to keep punctuation during tokenization)? |
space_hyphens |
Logical; treat hyphens between letters and at the start/end of words as punctuation? Other hyphens are always treated as punctuation. |
space_abbreviations |
Logical; treat apostrophes between letters as punctuation? Other apostrophes are always treated as punctuation. |
Value
The character vector, cleaned as specified.
Examples
piece1 <- " This is a \n\nfa\xE7ile\n\n example.\n"
# Specify encoding so this example behaves the same on all systems.
Encoding(piece1) <- "latin1"
example_text <- paste(
piece1,
"It has the bell character, \a, and the replacement character,",
intToUtf8(65533)
)
prepare_text(example_text)
prepare_text(example_text, squish_whitespace = FALSE)
prepare_text(example_text, remove_control_characters = FALSE)
prepare_text(example_text, remove_replacement_characters = FALSE)
prepare_text(example_text, remove_diacritics = FALSE)