wordpiece_tokenize {wordpiece} | R Documentation |
Tokenize Sequence with Word Pieces
Description
Given a sequence of text and a wordpiece vocabulary, tokenizes the text.
Usage
wordpiece_tokenize(
text,
vocab = wordpiece_vocab(),
unk_token = "[UNK]",
max_chars = 100
)
Arguments
text |
Character; text to tokenize. |
vocab |
Character vector of vocabulary tokens. The tokens are assumed to be in order of index, with the first index taken as zero to be compatible with Python implementations. |
unk_token |
Token to represent unknown words. |
max_chars |
Maximum length of word recognized. |
Value
A list of named integer vectors, giving the tokenization of the input sequences. The integer values are the token ids, and the names are the tokens.
Examples
tokens <- wordpiece_tokenize(
text = c(
"I love tacos!",
"I also kinda like apples."
)
)
[Package wordpiece version 2.1.3 Index]