pre_tokenizer_byte_level {tok}R Documentation

Byte level pre tokenizer

Description

Byte level pre tokenizer

Byte level pre tokenizer

Details

This pre-tokenizer takes care of replacing all bytes of the given string with a corresponding representation, as well as splitting into words.

Super class

tok::tok_pre_tokenizer -> tok_pre_tokenizer_whitespace

Methods

Public methods


Method new()

Initializes the bytelevel tokenizer

Usage
pre_tokenizer_byte_level$new(add_prefix_space = TRUE, use_regex = TRUE)
Arguments
add_prefix_space

Whether to add a space to the first word

use_regex

Set this to False to prevent this pre_tokenizer from using the GPT2 specific regexp for spliting on whitespace.


Method clone()

The objects of this class are cloneable with this method.

Usage
pre_tokenizer_byte_level$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

See Also

Other pre_tokenizer: pre_tokenizer, pre_tokenizer_whitespace


[Package tok version 0.1.3 Index]