Tools for Preparing Text for Tokenizers


[Up] [Top]

Documentation for package ‘piecemaker’ version 1.0.2

Help Pages

prepare_and_tokenize Split Text on Spaces
prepare_text Prepare Text for Tokenization
remove_control_characters Remove Non-Character Characters
remove_diacritics Remove Diacritical Marks on Characters
remove_replacement_characters Remove the Unicode Replacement Character
space_cjk Add Spaces Around CJK Ideographs
space_punctuation Add Spaces Around Punctuation
squish_whitespace Remove Extra Whitespace
tokenize_space Break Text at Spaces
validate_utf8 Clean Up Text to UTF-8