sentencepiece {sentencepiece} | R Documentation |
Construct a Sentencepiece model
Description
Construct a Sentencepiece model on text.
Usage
sentencepiece(
x,
type = c("bpe", "char", "unigram", "word"),
vocab_size = 8000,
coverage = 0.9999,
model_prefix = "sentencepiece",
model_dir = tempdir(),
threads = 1L,
args,
verbose = FALSE
)
Arguments
x |
a character vector of path(s) to the text files containing training data |
type |
either one of 'bpe', 'char', 'unigram' or 'word' for Byte Pair Encoding, Character level encoding, Unigram encoding or pretokenised word encoding. Defaults to 'bpe' (Byte Pair Encoding). |
vocab_size |
integer indicating the number of tokens in the final vocabulary. Defaults to 8000. |
coverage |
fraction of characters covered by the model. Must be in the range [0, 1]. A good value to use is about 0.9999. |
model_prefix |
character string with the name of the model. Defaults to 'sentencepiece'.
When executing the function 2 files will be created in the directory specified by |
model_dir |
directory where the model will be saved. Defaults to the temporary directory (tempdir()) |
threads |
integer indicating number of threads to use when building the model |
args |
character string with arguments passed on to sentencepiece::SentencePieceTrainer::Train (for expert use only) |
verbose |
logical indicating to show progress of sentencepiece training. Defaults to |
Value
an object of class sentencepiece
which is defined at sentencepiece_load_model
See Also
Examples
library(tokenizers.bpe)
data(belgium_parliament, package = "tokenizers.bpe")
path <- "traindata.txt"
folder <- getwd()
writeLines(belgium_parliament$text, con = path)
model <- sentencepiece(path, type = "char",
model_dir = folder, verbose = TRUE)
model <- sentencepiece(path, type = "unigram", vocab_size = 20000,
model_dir = folder, verbose = TRUE)
model <- sentencepiece(path, type = "bpe", vocab_size = 4000,
model_dir = folder, verbose = TRUE)
txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.",
"On est d'accord sur le prix de la biere?")
sentencepiece_encode(model, x = txt, type = "subwords")
sentencepiece_encode(model, x = txt, type = "ids")
model <- sentencepiece_load_model(file.path(folder, "sentencepiece.model"))
sentencepiece_encode(model, x = txt, type = "subwords")
sentencepiece_encode(model, x = txt, type = "ids")