sentencepiece {sentencepiece}R Documentation

Construct a Sentencepiece model

Description

Construct a Sentencepiece model on text.

Usage

sentencepiece(
  x,
  type = c("bpe", "char", "unigram", "word"),
  vocab_size = 8000,
  coverage = 0.9999,
  model_prefix = "sentencepiece",
  model_dir = tempdir(),
  threads = 1L,
  args,
  verbose = FALSE
)

Arguments

x

a character vector of path(s) to the text files containing training data

type

either one of 'bpe', 'char', 'unigram' or 'word' for Byte Pair Encoding, Character level encoding, Unigram encoding or pretokenised word encoding. Defaults to 'bpe' (Byte Pair Encoding).

vocab_size

integer indicating the number of tokens in the final vocabulary. Defaults to 8000.

coverage

fraction of characters covered by the model. Must be in the range [0, 1]. A good value to use is about 0.9999.

model_prefix

character string with the name of the model. Defaults to 'sentencepiece'. When executing the function 2 files will be created in the directory specified by model_dir, namely sentencepiece.model with the model and sentencepiece.vocab containing the vocabulary of the model. You can change the name of the model by providing the model_prefix argument.

model_dir

directory where the model will be saved. Defaults to the temporary directory (tempdir())

threads

integer indicating number of threads to use when building the model

args

character string with arguments passed on to sentencepiece::SentencePieceTrainer::Train (for expert use only)

verbose

logical indicating to show progress of sentencepiece training. Defaults to FALSE.

Value

an object of class sentencepiece which is defined at sentencepiece_load_model

See Also

sentencepiece_load_model

Examples

library(tokenizers.bpe)
data(belgium_parliament, package = "tokenizers.bpe")
path   <- "traindata.txt" 
folder <- getwd() 

writeLines(belgium_parliament$text, con = path)


model <- sentencepiece(path, type = "char", 
                       model_dir = folder, verbose = TRUE)
model <- sentencepiece(path, type = "unigram", vocab_size = 20000, 
                       model_dir = folder, verbose = TRUE)
model <- sentencepiece(path, type = "bpe", vocab_size = 4000, 
                       model_dir = folder, verbose = TRUE)

txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.",
         "On est d'accord sur le prix de la biere?")
sentencepiece_encode(model, x = txt, type = "subwords")
sentencepiece_encode(model, x = txt, type = "ids")


model <- sentencepiece_load_model(file.path(folder, "sentencepiece.model"))
sentencepiece_encode(model, x = txt, type = "subwords")
sentencepiece_encode(model, x = txt, type = "ids")




[Package sentencepiece version 0.2.3 Index]