R: Tokenize Sequence with Morpheme Pieces

morphemepiece_tokenize {morphemepiece}

R Documentation

Tokenize Sequence with Morpheme Pieces

Given a single sequence of text and a morphemepiece vocabulary, tokenizes the text.

morphemepiece_tokenize(
  text,
  vocab = morphemepiece_vocab(),
  lookup = morphemepiece_lookup(),
  unk_token = "[UNK]",
  max_chars = 100
)

`text`	Character scalar; text to tokenize.
`vocab`	A morphemepiece vocabulary.
`lookup`	A morphemepiece lookup table.
`unk_token`	Token to represent unknown words.
`max_chars`	Maximum length of word recognized.

A character vector of tokenized text (later, this should be a named integer vector, as in the wordpiece package.)

[Package morphemepiece version 1.2.3 Index]