R: Tokenize Sequence with Word Pieces

wordpiece_tokenize {wordpiece}

R Documentation

Tokenize Sequence with Word Pieces

Given a sequence of text and a wordpiece vocabulary, tokenizes the text.

wordpiece_tokenize(
  text,
  vocab = wordpiece_vocab(),
  unk_token = "[UNK]",
  max_chars = 100
)

`text`	Character; text to tokenize.
`vocab`	Character vector of vocabulary tokens. The tokens are assumed to be in order of index, with the first index taken as zero to be compatible with Python implementations.
`unk_token`	Token to represent unknown words.
`max_chars`	Maximum length of word recognized.

A list of named integer vectors, giving the tokenization of the input sequences. The integer values are the token ids, and the names are the tokens.

tokens <- wordpiece_tokenize(
  text = c(
    "I love tacos!",
    "I also kinda like apples."
  )
)

[Package wordpiece version 2.1.3 Index]