R: Tokenizer

tokenizer {tok}

R Documentation

Tokenizer

Description

A Tokenizer works as a pipeline. It processes some raw text as input and outputs an encoding.

Value

A tokenizer that can be used for encoding character strings or decoding integers.

Public fields

.tokenizer: (unsafe usage) Lower level pointer to tokenizer

Active bindings

pre_tokenizer: instance of the pre-tokenizer
normalizer: Gets the normalizer instance
post_processor: Gets the post processor used by tokenizer
decoder: Gets and sets the decoder
padding: Gets padding configuration
truncation: Gets truncation configuration

Methods

Public methods

tokenizer$new()
tokenizer$encode()
tokenizer$decode()
tokenizer$encode_batch()
tokenizer$decode_batch()
tokenizer$from_file()
tokenizer$from_pretrained()
tokenizer$train()
tokenizer$train_from_memory()
tokenizer$save()
tokenizer$enable_padding()
tokenizer$no_padding()
tokenizer$enable_truncation()
tokenizer$no_truncation()
tokenizer$get_vocab_size()
tokenizer$clone()

Method `new()`

Initializes a tokenizer

Usage

tokenizer$new(tokenizer)

Arguments

tokenizer: Will be cloned to initialize a new tokenizer

Method `encode()`

Encode the given sequence and pair. This method can process raw text sequences as well as already pre-tokenized sequences.

Usage

tokenizer$encode(
  sequence,
  pair = NULL,
  is_pretokenized = FALSE,
  add_special_tokens = TRUE
)

Arguments

sequence: The main input sequence we want to encode. This sequence can be either raw text or pre-tokenized, according to the is_pretokenized argument
pair: An optional input sequence. The expected format is the same that for sequence.
is_pretokenized: Whether the input is already pre-tokenized
add_special_tokens: Whether to add the special tokens

Method `decode()`

Decode the given list of ids back to a string

Usage

tokenizer$decode(ids, skip_special_tokens = TRUE)

Arguments

ids: The list of ids that we want to decode
skip_special_tokens: Whether the special tokens should be removed from the decoded string

Method `encode_batch()`

Encodes a batch of sequences. Returns a list of encodings.

Usage

tokenizer$encode_batch(
  input,
  is_pretokenized = FALSE,
  add_special_tokens = TRUE
)

Arguments

input: A list of single sequences or pair sequences to encode. Each sequence can be either raw text or pre-tokenized, according to the is_pretokenized argument.
is_pretokenized: Whether the input is already pre-tokenized
add_special_tokens: Whether to add the special tokens

Method `decode_batch()`

Decode a batch of ids back to their corresponding string

Usage

tokenizer$decode_batch(sequences, skip_special_tokens = TRUE)

Arguments

sequences: The batch of sequences we want to decode
skip_special_tokens: Whether the special tokens should be removed from the decoded strings

Method `from_file()`

Creates a tokenizer from the path of a serialized tokenizer. This is a static method and should be called instead of ⁠$new⁠ when initializing the tokenizer.

Usage

tokenizer$from_file(path)

Arguments

path: Path to tokenizer.json file

Method `from_pretrained()`

Instantiate a new Tokenizer from an existing file on the Hugging Face Hub.

Usage

tokenizer$from_pretrained(identifier, revision = "main", auth_token = NULL)

Arguments

identifier: The identifier of a Model on the Hugging Face Hub, that contains a tokenizer.json file
revision: A branch or commit id
auth_token: An optional auth token used to access private repositories on the Hugging Face Hub

Method `train()`

Train the Tokenizer using the given files. Reads the files line by line, while keeping all the whitespace, even new lines.

Usage

tokenizer$train(files, trainer)

Arguments

files: character vector of file paths.
trainer: an instance of a trainer object, specific to that tokenizer type.

Method `train_from_memory()`

Train the tokenizer on a chracter vector of texts

Usage

tokenizer$train_from_memory(texts, trainer)

Arguments

texts: a character vector of texts.
trainer: an instance of a trainer object, specific to that tokenizer type.

Method `save()`

Saves the tokenizer to a json file

Usage

tokenizer$save(path, pretty = TRUE)

Arguments

path: A path to a file in which to save the serialized tokenizer.
pretty: Whether the JSON file should be pretty formatted.

Method `enable_padding()`

Enables padding for the tokenizer

Usage

tokenizer$enable_padding(
  direction = "right",
  pad_id = 0L,
  pad_type_id = 0L,
  pad_token = "[PAD]",
  length = NULL,
  pad_to_multiple_of = NULL
)

Arguments

direction: (str, optional, defaults to right) — The direction in which to pad. Can be either 'right' or 'left'
pad_id: (int, defaults to 0) — The id to be used when padding
pad_type_id: (int, defaults to 0) — The type id to be used when padding
pad_token: (str, defaults to '[PAD]') — The pad token to be used when padding
length: (int, optional) — If specified, the length at which to pad. If not specified we pad using the size of the longest sequence in a batch.
pad_to_multiple_of: (int, optional) — If specified, the padding length should always snap to the next multiple of the given value. For example if we were going to pad with a length of 250 but pad_to_multiple_of=8 then we will pad to 256.

Method `no_padding()`

Disables padding

Usage

tokenizer$no_padding()

Method `enable_truncation()`

Enables truncation on the tokenizer

Usage

tokenizer$enable_truncation(
  max_length,
  stride = 0,
  strategy = "longest_first",
  direction = "right"
)

Arguments

max_length: The maximum length at which to truncate.
stride: The length of the previous first sequence to be included in the overflowing sequence. Default: 0.
strategy: The strategy used for truncation. Can be one of: "longest_first", "only_first", or "only_second". Default: "longest_first".
direction: The truncation direction. Default: "right".

Method `no_truncation()`

Disables truncation

Usage

tokenizer$no_truncation()

Method `get_vocab_size()`

Gets the vocabulary size

Usage

tokenizer$get_vocab_size(with_added_tokens = TRUE)

Arguments

with_added_tokens: Wether to count added tokens

Method `clone()`

The objects of this class are cloneable with this method.

Usage

tokenizer$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Examples

withr::with_envvar(c(HUGGINGFACE_HUB_CACHE = tempdir()), {
try({
tok <- tokenizer$from_pretrained("gpt2")
tok$encode("Hello world")$ids
})
})

[Package tok version 0.1.3 Index]

Tokenizer

Description

Value

Public fields

Active bindings

Methods

Public methods

Method new()

Usage

Arguments

Method encode()

Usage

Arguments

Method decode()

Usage

Arguments

Method encode_batch()

Usage

Arguments

Method decode_batch()

Usage

Arguments

Method from_file()

Usage

Arguments

Method from_pretrained()

Usage

Arguments

Method train()

Usage

Arguments

Method train_from_memory()

Usage

Arguments

Method save()

Usage

Arguments

Method enable_padding()

Usage

Arguments

Method no_padding()

Usage

Method enable_truncation()

Usage

Arguments

Method no_truncation()

Usage

Method get_vocab_size()

Usage

Arguments

Method clone()

Usage

Arguments

Examples

Method `new()`

Method `encode()`

Method `decode()`

Method `encode_batch()`

Method `decode_batch()`

Method `from_file()`

Method `from_pretrained()`

Method `train()`

Method `train_from_memory()`

Method `save()`

Method `enable_padding()`

Method `no_padding()`

Method `enable_truncation()`

Method `no_truncation()`

Method `get_vocab_size()`

Method `clone()`