load_vocab {morphemepiece} | R Documentation |
Load a vocabulary file
Description
Usually you will want to use the included vocabulary that can be accessed via
morphemepiece_vocab()
. This function can be used to load a different
vocabulary from a file.
Usage
load_vocab(vocab_file)
Arguments
vocab_file |
path to vocabulary file. File is assumed to be a text file, with one token per line, with the line number (starting at zero) corresponding to the index of that token in the vocabulary. |
Value
The vocab as a character vector of tokens. The casedness of the vocabulary is inferred and attached as the "is_cased" attribute. The vocabulary indices are taken to be the positions of the tokens, starting at zero for historical consistency.
Note that from the perspective of a neural net, the numeric indices are the tokens, and the mapping from token to index is fixed. If we changed the indexing, it would break any pre-trained models using that vocabulary.