morphemepiece_tokenize {morphemepiece}R Documentation

Tokenize Sequence with Morpheme Pieces

Description

Given a single sequence of text and a morphemepiece vocabulary, tokenizes the text.

Usage

morphemepiece_tokenize(
  text,
  vocab = morphemepiece_vocab(),
  lookup = morphemepiece_lookup(),
  unk_token = "[UNK]",
  max_chars = 100
)

Arguments

text

Character scalar; text to tokenize.

vocab

A morphemepiece vocabulary.

lookup

A morphemepiece lookup table.

unk_token

Token to represent unknown words.

max_chars

Maximum length of word recognized.

Value

A character vector of tokenized text (later, this should be a named integer vector, as in the wordpiece package.)


[Package morphemepiece version 1.2.3 Index]