tokenizer {tok}R Documentation

Tokenizer

Description

A Tokenizer works as a pipeline. It processes some raw text as input and outputs an encoding.

Value

A tokenizer that can be used for encoding character strings or decoding integers.

Public fields

.tokenizer

(unsafe usage) Lower level pointer to tokenizer

Active bindings

pre_tokenizer

instance of the pre-tokenizer

normalizer

Gets the normalizer instance

post_processor

Gets the post processor used by tokenizer

decoder

Gets and sets the decoder

padding

Gets padding configuration

truncation

Gets truncation configuration

Methods

Public methods


Method new()

Initializes a tokenizer

Usage
tokenizer$new(tokenizer)
Arguments
tokenizer

Will be cloned to initialize a new tokenizer


Method encode()

Encode the given sequence and pair. This method can process raw text sequences as well as already pre-tokenized sequences.

Usage
tokenizer$encode(
  sequence,
  pair = NULL,
  is_pretokenized = FALSE,
  add_special_tokens = TRUE
)
Arguments
sequence

The main input sequence we want to encode. This sequence can be either raw text or pre-tokenized, according to the is_pretokenized argument

pair

An optional input sequence. The expected format is the same that for sequence.

is_pretokenized

Whether the input is already pre-tokenized

add_special_tokens

Whether to add the special tokens


Method decode()

Decode the given list of ids back to a string

Usage
tokenizer$decode(ids, skip_special_tokens = TRUE)
Arguments
ids

The list of ids that we want to decode

skip_special_tokens

Whether the special tokens should be removed from the decoded string


Method encode_batch()

Encodes a batch of sequences. Returns a list of encodings.

Usage
tokenizer$encode_batch(
  input,
  is_pretokenized = FALSE,
  add_special_tokens = TRUE
)
Arguments
input

A list of single sequences or pair sequences to encode. Each sequence can be either raw text or pre-tokenized, according to the is_pretokenized argument.

is_pretokenized

Whether the input is already pre-tokenized

add_special_tokens

Whether to add the special tokens


Method decode_batch()

Decode a batch of ids back to their corresponding string

Usage
tokenizer$decode_batch(sequences, skip_special_tokens = TRUE)
Arguments
sequences

The batch of sequences we want to decode

skip_special_tokens

Whether the special tokens should be removed from the decoded strings


Method from_file()

Creates a tokenizer from the path of a serialized tokenizer. This is a static method and should be called instead of ⁠$new⁠ when initializing the tokenizer.

Usage
tokenizer$from_file(path)
Arguments
path

Path to tokenizer.json file


Method from_pretrained()

Instantiate a new Tokenizer from an existing file on the Hugging Face Hub.

Usage
tokenizer$from_pretrained(identifier, revision = "main", auth_token = NULL)
Arguments
identifier

The identifier of a Model on the Hugging Face Hub, that contains a tokenizer.json file

revision

A branch or commit id

auth_token

An optional auth token used to access private repositories on the Hugging Face Hub


Method train()

Train the Tokenizer using the given files. Reads the files line by line, while keeping all the whitespace, even new lines.

Usage
tokenizer$train(files, trainer)
Arguments
files

character vector of file paths.

trainer

an instance of a trainer object, specific to that tokenizer type.


Method train_from_memory()

Train the tokenizer on a chracter vector of texts

Usage
tokenizer$train_from_memory(texts, trainer)
Arguments
texts

a character vector of texts.

trainer

an instance of a trainer object, specific to that tokenizer type.


Method save()

Saves the tokenizer to a json file

Usage
tokenizer$save(path, pretty = TRUE)
Arguments
path

A path to a file in which to save the serialized tokenizer.

pretty

Whether the JSON file should be pretty formatted.


Method enable_padding()

Enables padding for the tokenizer

Usage
tokenizer$enable_padding(
  direction = "right",
  pad_id = 0L,
  pad_type_id = 0L,
  pad_token = "[PAD]",
  length = NULL,
  pad_to_multiple_of = NULL
)
Arguments
direction

(str, optional, defaults to right) — The direction in which to pad. Can be either 'right' or 'left'

pad_id

(int, defaults to 0) — The id to be used when padding

pad_type_id

(int, defaults to 0) — The type id to be used when padding

pad_token

(str, defaults to '[PAD]') — The pad token to be used when padding

length

(int, optional) — If specified, the length at which to pad. If not specified we pad using the size of the longest sequence in a batch.

pad_to_multiple_of

(int, optional) — If specified, the padding length should always snap to the next multiple of the given value. For example if we were going to pad with a length of 250 but pad_to_multiple_of=8 then we will pad to 256.


Method no_padding()

Disables padding

Usage
tokenizer$no_padding()

Method enable_truncation()

Enables truncation on the tokenizer

Usage
tokenizer$enable_truncation(
  max_length,
  stride = 0,
  strategy = "longest_first",
  direction = "right"
)
Arguments
max_length

The maximum length at which to truncate.

stride

The length of the previous first sequence to be included in the overflowing sequence. Default: 0.

strategy

The strategy used for truncation. Can be one of: "longest_first", "only_first", or "only_second". Default: "longest_first".

direction

The truncation direction. Default: "right".


Method no_truncation()

Disables truncation

Usage
tokenizer$no_truncation()

Method get_vocab_size()

Gets the vocabulary size

Usage
tokenizer$get_vocab_size(with_added_tokens = TRUE)
Arguments
with_added_tokens

Wether to count added tokens


Method clone()

The objects of this class are cloneable with this method.

Usage
tokenizer$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

Examples

withr::with_envvar(c(HUGGINGFACE_HUB_CACHE = tempdir()), {
try({
tok <- tokenizer$from_pretrained("gpt2")
tok$encode("Hello world")$ids
})
})


[Package tok version 0.1.3 Index]