| trainer_wordpiece {tok} | R Documentation |
WordPiece tokenizer trainer
Description
WordPiece tokenizer trainer
WordPiece tokenizer trainer
Super class
tok::tok_trainer -> tok_trainer_wordpiece
Methods
Public methods
Method new()
Constructor for the WordPiece tokenizer trainer
Usage
trainer_wordpiece$new( vocab_size = 30000, min_frequency = 0, show_progress = FALSE, special_tokens = NULL, limit_alphabet = NULL, initial_alphabet = NULL, continuing_subword_prefix = "##", end_of_word_suffix = NULL )
Arguments
vocab_sizeThe size of the final vocabulary, including all tokens and alphabet. Default:
NULL.min_frequencyThe minimum frequency a pair should have in order to be merged. Default:
NULL.show_progressWhether to show progress bars while training. Default:
TRUE.special_tokensA list of special tokens the model should be aware of. Default:
NULL.limit_alphabetThe maximum number of different characters to keep in the alphabet. Default:
NULL.initial_alphabetA list of characters to include in the initial alphabet, even if not seen in the training dataset. If the strings contain more than one character, only the first one is kept. Default:
NULL.continuing_subword_prefixA prefix to be used for every subword that is not a beginning-of-word. Default:
NULL.end_of_word_suffixA suffix to be used for every subword that is an end-of-word. Default:
NULL.
Method clone()
The objects of this class are cloneable with this method.
Usage
trainer_wordpiece$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
Other trainer:
tok_trainer,
trainer_bpe,
trainer_unigram