R: Prepare Text for Tokenization

prepare_text {piecemaker}

R Documentation

Prepare Text for Tokenization

Description

This function combines the other functions in this package to prepare text for tokenization. The text gets converted to valid UTF-8 (if possible), and then various cleaning functions are applied.

Usage

prepare_text(
  text,
  squish_whitespace = TRUE,
  remove_terminal_hyphens = TRUE,
  remove_control_characters = TRUE,
  remove_replacement_characters = TRUE,
  remove_diacritics = TRUE,
  space_cjk = TRUE,
  space_punctuation = TRUE,
  space_hyphens = TRUE,
  space_abbreviations = TRUE
)

Arguments

`text`	A character vector to clean.
`squish_whitespace`	Logical scalar; squish whitespace characters (using `str_squish`)?
`remove_terminal_hyphens`	Logical; should hyphens at the end of lines after a word be removed? For example, "un-\nbroken" would become "unbroken".
`remove_control_characters`	Logical scalar; remove control characters?
`remove_replacement_characters`	Logical scalar; remove the "replacement character", `U-FFFD`?
`remove_diacritics`	Logical scalar; remove diacritical marks (accents, etc) from characters?
`space_cjk`	Logical scalar; add spaces around Chinese/Japanese/Korean ideographs?
`space_punctuation`	Logical scalar; add spaces around punctuation (to make it easier to keep punctuation during tokenization)?
`space_hyphens`	Logical; treat hyphens between letters and at the start/end of words as punctuation? Other hyphens are always treated as punctuation.
`space_abbreviations`	Logical; treat apostrophes between letters as punctuation? Other apostrophes are always treated as punctuation.

Value

The character vector, cleaned as specified.

Examples

piece1 <- " This is a    \n\nfa\xE7ile\n\n    example.\n"
# Specify encoding so this example behaves the same on all systems.
Encoding(piece1) <- "latin1"
example_text <- paste(
  piece1,
  "It has the bell character, \a, and the replacement character,",
  intToUtf8(65533)
)
prepare_text(example_text)
prepare_text(example_text, squish_whitespace = FALSE)
prepare_text(example_text, remove_control_characters = FALSE)
prepare_text(example_text, remove_replacement_characters = FALSE)
prepare_text(example_text, remove_diacritics = FALSE)

[Package piecemaker version 1.0.2 Index]