wordpiece_encode {sentencepiece} | R Documentation |
Wordpiece encoding
Description
Wordpiece encoding, usefull for BERT-style tokenisation. Experimental version mimicing class WordpieceTokenizer from https://github.com/huggingface/transformers/blob/master/src/transformers/models/bert/tokenization_bert.py
Usage
wordpiece_encode(
x,
vocabulary = character(),
type = c("subwords", "ids"),
unk_token = "[UNK]",
max_input_chars_per_word = 100L
)
Arguments
x |
a character vector with text which can be splitted based on white space to obtain words |
vocabulary |
a character vector of the vocabulary |
type |
a character string, either 'subwords' or 'ids' to get the subwords or the corresponding ids of these subwords as defined in the vocabulary of the model. Defaults to 'subwords'. |
unk_token |
character string with a value for a token which is not part of the vocabulary. Defaults to '[UNK]' |
max_input_chars_per_word |
integer. A word which is longer than this specified number of characters will be set to the unknown token. |
Value
a list of subword tokens
Examples
wordpiece_encode("unaffable", vocabulary = c("un", "##aff", "##able"))
wordpiece_encode(x = c("unaffable", "unaffableun"),
vocabulary = c("un", "##aff", "##able"))
wordpiece_encode(x = c("unaffable", "unaffableun", "unknown territory"),
vocabulary = c("un", "##aff", "##able", "##un"))
wordpiece_encode(x = c("unaffable", "unaffableun", "unknown territory"),
vocabulary = c("un", "##aff", "##able", "##un"),
type = "ids")