trainer_unigram {tok} | R Documentation |
Unigram tokenizer trainer
Description
Unigram tokenizer trainer
Unigram tokenizer trainer
Super class
tok::tok_trainer
-> tok_trainer_unigram
Methods
Public methods
Method new()
Constructor for the Unigram tokenizer
Usage
trainer_unigram$new( vocab_size = 8000, show_progress = TRUE, special_tokens = NULL, shrinking_factor = 0.75, unk_token = NULL, max_piece_length = 16, n_sub_iterations = 2 )
Arguments
vocab_size
The size of the final vocabulary, including all tokens and alphabet.
show_progress
Whether to show progress bars while training.
special_tokens
A list of special tokens the model should be aware of.
shrinking_factor
The shrinking factor used at each step of training to prune the vocabulary.
unk_token
The token used for out-of-vocabulary tokens.
max_piece_length
The maximum length of a given token.
n_sub_iterations
The number of iterations of the EM algorithm to perform before pruning the vocabulary.
initial_alphabet
A list of characters to include in the initial alphabet, even if not seen in the training dataset. If the strings contain more than one character, only the first one is kept.
Method clone()
The objects of this class are cloneable with this method.
Usage
trainer_unigram$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
See Also
Other trainer:
tok_trainer
,
trainer_bpe
,
trainer_wordpiece