bpe_encode {tokenizers.bpe} | R Documentation |
Tokenise text alongside a Byte Pair Encoding model
Description
Tokenise text alongside a Byte Pair Encoding model
Usage
bpe_encode(
model,
x,
type = c("subwords", "ids"),
bos = FALSE,
eos = FALSE,
reverse = FALSE
)
Arguments
model |
an object of class |
x |
a character vector of text to tokenise |
type |
a character string, either 'subwords' or 'ids' to get the subwords or the corresponding ids of these subwords as defined in the vocabulary of the model. Defaults to 'subwords'. |
bos |
logical if set to TRUE then token 'beginning of sentence' will be added |
eos |
logical if set to TRUE then token 'end of sentence' will be added |
reverse |
logical if set to TRUE the output sequence of tokens will be reversed |
Examples
data(belgium_parliament, package = "tokenizers.bpe")
x <- subset(belgium_parliament, language == "french")
model <- bpe(x$text, coverage = 0.999, vocab_size = 5000, threads = 1)
model
str(model$vocabulary)
text <- c("L'appartement est grand & vraiment bien situe en plein centre",
"Proportion de femmes dans les situations de famille monoparentale.")
bpe_encode(model, x = text, type = "subwords")
bpe_encode(model, x = text, type = "ids")
encoded <- bpe_encode(model, x = text, type = "ids")
decoded <- bpe_decode(model, encoded)
decoded
## Remove the model file (Clean up for CRAN)
file.remove(model$model_path)
[Package tokenizers.bpe version 0.1.3 Index]