pre_tokenizer_byte_level {tok} | R Documentation |
Byte level pre tokenizer
Description
Byte level pre tokenizer
Byte level pre tokenizer
Details
This pre-tokenizer takes care of replacing all bytes of the given string with a corresponding representation, as well as splitting into words.
Super class
tok::tok_pre_tokenizer
-> tok_pre_tokenizer_whitespace
Methods
Public methods
Method new()
Initializes the bytelevel tokenizer
Usage
pre_tokenizer_byte_level$new(add_prefix_space = TRUE, use_regex = TRUE)
Arguments
add_prefix_space
Whether to add a space to the first word
use_regex
Set this to False to prevent this pre_tokenizer from using the GPT2 specific regexp for spliting on whitespace.
Method clone()
The objects of this class are cloneable with this method.
Usage
pre_tokenizer_byte_level$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
See Also
Other pre_tokenizer:
pre_tokenizer
,
pre_tokenizer_whitespace
[Package tok version 0.1.3 Index]