prepare_and_tokenize | Split Text on Spaces |
prepare_text | Prepare Text for Tokenization |
remove_control_characters | Remove Non-Character Characters |
remove_diacritics | Remove Diacritical Marks on Characters |
remove_replacement_characters | Remove the Unicode Replacement Character |
space_cjk | Add Spaces Around CJK Ideographs |
space_punctuation | Add Spaces Around Punctuation |
squish_whitespace | Remove Extra Whitespace |
tokenize_space | Break Text at Spaces |
validate_utf8 | Clean Up Text to UTF-8 |