| trainer_unigram {tok} | R Documentation |
Unigram tokenizer trainer
Description
Unigram tokenizer trainer
Unigram tokenizer trainer
Super class
tok::tok_trainer -> tok_trainer_unigram
Methods
Public methods
Method new()
Constructor for the Unigram tokenizer
Usage
trainer_unigram$new( vocab_size = 8000, show_progress = TRUE, special_tokens = NULL, shrinking_factor = 0.75, unk_token = NULL, max_piece_length = 16, n_sub_iterations = 2 )
Arguments
vocab_sizeThe size of the final vocabulary, including all tokens and alphabet.
show_progressWhether to show progress bars while training.
special_tokensA list of special tokens the model should be aware of.
shrinking_factorThe shrinking factor used at each step of training to prune the vocabulary.
unk_tokenThe token used for out-of-vocabulary tokens.
max_piece_lengthThe maximum length of a given token.
n_sub_iterationsThe number of iterations of the EM algorithm to perform before pruning the vocabulary.
initial_alphabetA list of characters to include in the initial alphabet, even if not seen in the training dataset. If the strings contain more than one character, only the first one is kept.
Method clone()
The objects of this class are cloneable with this method.
Usage
trainer_unigram$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
Other trainer:
tok_trainer,
trainer_bpe,
trainer_wordpiece