trainer_wordpiece {tok}R Documentation

WordPiece tokenizer trainer

Description

WordPiece tokenizer trainer

WordPiece tokenizer trainer

Super class

tok::tok_trainer -> tok_trainer_wordpiece

Methods

Public methods


Method new()

Constructor for the WordPiece tokenizer trainer

Usage
trainer_wordpiece$new(
  vocab_size = 30000,
  min_frequency = 0,
  show_progress = FALSE,
  special_tokens = NULL,
  limit_alphabet = NULL,
  initial_alphabet = NULL,
  continuing_subword_prefix = "##",
  end_of_word_suffix = NULL
)
Arguments
vocab_size

The size of the final vocabulary, including all tokens and alphabet. Default: NULL.

min_frequency

The minimum frequency a pair should have in order to be merged. Default: NULL.

show_progress

Whether to show progress bars while training. Default: TRUE.

special_tokens

A list of special tokens the model should be aware of. Default: NULL.

limit_alphabet

The maximum number of different characters to keep in the alphabet. Default: NULL.

initial_alphabet

A list of characters to include in the initial alphabet, even if not seen in the training dataset. If the strings contain more than one character, only the first one is kept. Default: NULL.

continuing_subword_prefix

A prefix to be used for every subword that is not a beginning-of-word. Default: NULL.

end_of_word_suffix

A suffix to be used for every subword that is an end-of-word. Default: NULL.


Method clone()

The objects of this class are cloneable with this method.

Usage
trainer_wordpiece$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

See Also

Other trainer: tok_trainer, trainer_bpe, trainer_unigram


[Package tok version 0.1.3 Index]