trainer_unigram {tok}R Documentation

Unigram tokenizer trainer

Description

Unigram tokenizer trainer

Unigram tokenizer trainer

Super class

tok::tok_trainer -> tok_trainer_unigram

Methods

Public methods


Method new()

Constructor for the Unigram tokenizer

Usage
trainer_unigram$new(
  vocab_size = 8000,
  show_progress = TRUE,
  special_tokens = NULL,
  shrinking_factor = 0.75,
  unk_token = NULL,
  max_piece_length = 16,
  n_sub_iterations = 2
)
Arguments
vocab_size

The size of the final vocabulary, including all tokens and alphabet.

show_progress

Whether to show progress bars while training.

special_tokens

A list of special tokens the model should be aware of.

shrinking_factor

The shrinking factor used at each step of training to prune the vocabulary.

unk_token

The token used for out-of-vocabulary tokens.

max_piece_length

The maximum length of a given token.

n_sub_iterations

The number of iterations of the EM algorithm to perform before pruning the vocabulary.

initial_alphabet

A list of characters to include in the initial alphabet, even if not seen in the training dataset. If the strings contain more than one character, only the first one is kept.


Method clone()

The objects of this class are cloneable with this method.

Usage
trainer_unigram$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

See Also

Other trainer: tok_trainer, trainer_bpe, trainer_wordpiece


[Package tok version 0.1.3 Index]