sentencepiece_download_model {sentencepiece}R Documentation

Download a Sentencepiece model

Description

Download pretrained models built on Wikipedia made available at https://bpemb.h-its.org through https://github.com/bheinzerling/bpemb. These models contain Byte Pair Encoded models trained with sentencepiece as well as Glove embeddings of these Byte Pair subwords. Models for 275 languages are available.

Usage

sentencepiece_download_model(
  language,
  vocab_size,
  dim,
  model_dir = system.file(package = "sentencepiece", "models")
)

Arguments

language

a character string with the language name. This can be either a plain language or a wikipedia shorthand.
Possible values can be found by looking at the examples or typing sentencepiece:::.bpemb$languages
If you provide multi it downloads the multilingual model available at https://bpemb.h-its.org/multi/

vocab_size

integer indicating the number of tokens in the final vocabulary. Defaults to 5000. Possible values depend on the language. To inspect possible values, type sentencepiece:::.bpemb$vocab_sizes and look to your language of your choice.

dim

dimension of the embedding. Either 25, 50, 100, 200 or 300.

model_dir

path to the location where the model will be downloaded to. Defaults to system.file(package = "sentencepiece", "models").

Value

a list with elements

See Also

sentencepiece_load_model

Examples

path <- getwd()



##
## Download only the tokeniser model
##
dl <- sentencepiece_download_model("Russian", vocab_size = 50000, model_dir = path)
dl <- sentencepiece_download_model("English", vocab_size = 100000, model_dir = path)
dl <- sentencepiece_download_model("French", vocab_size = 25000, model_dir = path)
dl <- sentencepiece_download_model("multi", vocab_size = 320000, model_dir = path)
dl <- sentencepiece_download_model("Vlaams", vocab_size = 1000, model_dir = path)
dl <- sentencepiece_download_model("Dutch", vocab_size = 25000, model_dir = path)
dl <- sentencepiece_download_model("nl", vocab_size = 25000, model_dir = path)
str(dl)
model     <- sentencepiece_load_model(dl$file_model)

##
## Download the tokeniser model + Glove embeddings of Byte Pairs
##
dl <- sentencepiece_download_model("nl", vocab_size = 1000, dim = 50, model_dir = path)
str(dl)
model     <- sentencepiece_load_model(dl$file_model)
embedding <- read_word2vec(dl$glove$file_model)



dl <- sentencepiece_download_model("nl", vocab_size = 1000, dim = 25,
                                   model_dir = tempdir())
str(dl)



[Package sentencepiece version 0.2.3 Index]