morphemepiece_tokenize {morphemepiece} | R Documentation |
Tokenize Sequence with Morpheme Pieces
Description
Given a single sequence of text and a morphemepiece vocabulary, tokenizes the text.
Usage
morphemepiece_tokenize(
text,
vocab = morphemepiece_vocab(),
lookup = morphemepiece_lookup(),
unk_token = "[UNK]",
max_chars = 100
)
Arguments
text |
Character scalar; text to tokenize. |
vocab |
A morphemepiece vocabulary. |
lookup |
A morphemepiece lookup table. |
unk_token |
Token to represent unknown words. |
max_chars |
Maximum length of word recognized. |
Value
A character vector of tokenized text (later, this should be a named integer vector, as in the wordpiece package.)
[Package morphemepiece version 1.2.3 Index]