prepare_vocab {morphemepiece} | R Documentation |
Format a Token List as a Vocabulary
Description
We use a character vector with class morphemepiece_vocabulary to provide
information about tokens used in
morphemepiece_tokenize
. This function takes a character vector
of tokens and puts it into that format.
Usage
prepare_vocab(token_list)
Arguments
token_list |
A character vector of tokens. |
Value
The vocab as a character vector of tokens. The casedness of the vocabulary is inferred and attached as the "is_cased" attribute. The vocabulary indices are taken to be the positions of the tokens, starting at zero for historical consistency.
Note that from the perspective of a neural net, the numeric indices are the tokens, and the mapping from token to index is fixed. If we changed the indexing, it would break any pre-trained models using that vocabulary.
Examples
my_vocab <- prepare_vocab(c("some", "example", "tokens"))
class(my_vocab)
attr(my_vocab, "is_cased")
[Package morphemepiece version 1.2.3 Index]