prepare_vocab {wordpiece} | R Documentation |
Format a Token List as a Vocabulary
Description
We use a special named integer vector with class wordpiece_vocabulary to
provide information about tokens used in wordpiece_tokenize
.
This function takes a character vector of tokens and puts it into that
format.
Usage
prepare_vocab(token_list)
Arguments
token_list |
A character vector of tokens. |
Value
The vocab as a character vector of tokens. The casedness of the vocabulary is inferred and attached as the "is_cased" attribute. The vocabulary indices are taken to be the positions of the tokens, starting at zero for historical consistency.
Note that from the perspective of a neural net, the numeric indices are the tokens, and the mapping from token to index is fixed. If we changed the indexing (the order of the tokens), it would break any pre-trained models.
Examples
my_vocab <- prepare_vocab(c("some", "example", "tokens"))
class(my_vocab)
attr(my_vocab, "is_cased")
[Package wordpiece version 2.1.3 Index]