R: Byte level pre tokenizer

pre_tokenizer_byte_level {tok}

R Documentation

Byte level pre tokenizer

Byte level pre tokenizer

This pre-tokenizer takes care of replacing all bytes of the given string with a corresponding representation, as well as splitting into words.

tok::tok_pre_tokenizer -> tok_pre_tokenizer_whitespace

Initializes the bytelevel tokenizer

pre_tokenizer_byte_level$new(add_prefix_space = TRUE, use_regex = TRUE)

add_prefix_space: Whether to add a space to the first word
use_regex: Set this to False to prevent this pre_tokenizer from using the GPT2 specific regexp for spliting on whitespace.

The objects of this class are cloneable with this method.

pre_tokenizer_byte_level$clone(deep = FALSE)