R: Unigram tokenizer trainer

trainer_unigram {tok}

R Documentation

Unigram tokenizer trainer

Description

Unigram tokenizer trainer

Super class

tok::tok_trainer -> tok_trainer_unigram

Methods

Public methods

trainer_unigram$new()
trainer_unigram$clone()

Method `new()`

Constructor for the Unigram tokenizer

Usage

trainer_unigram$new(
  vocab_size = 8000,
  show_progress = TRUE,
  special_tokens = NULL,
  shrinking_factor = 0.75,
  unk_token = NULL,
  max_piece_length = 16,
  n_sub_iterations = 2
)

Arguments

vocab_size: The size of the final vocabulary, including all tokens and alphabet.
show_progress: Whether to show progress bars while training.
special_tokens: A list of special tokens the model should be aware of.
shrinking_factor: The shrinking factor used at each step of training to prune the vocabulary.
unk_token: The token used for out-of-vocabulary tokens.
max_piece_length: The maximum length of a given token.
n_sub_iterations: The number of iterations of the EM algorithm to perform before pruning the vocabulary.
initial_alphabet: A list of characters to include in the initial alphabet, even if not seen in the training dataset. If the strings contain more than one character, only the first one is kept.

Method `clone()`

The objects of this class are cloneable with this method.

Usage

trainer_unigram$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Unigram tokenizer trainer

Description

Super class

Methods

Public methods

Method new()

Usage

Arguments

Method clone()

Usage

Arguments

See Also

Method `new()`

Method `clone()`