sentencepiece_download_model {sentencepiece} | R Documentation |
Download a Sentencepiece model
Description
Download pretrained models built on Wikipedia made available at https://bpemb.h-its.org through https://github.com/bheinzerling/bpemb. These models contain Byte Pair Encoded models trained with sentencepiece as well as Glove embeddings of these Byte Pair subwords. Models for 275 languages are available.
Usage
sentencepiece_download_model(
language,
vocab_size,
dim,
model_dir = system.file(package = "sentencepiece", "models")
)
Arguments
language |
a character string with the language name. This can be either a plain language or a wikipedia shorthand. |
vocab_size |
integer indicating the number of tokens in the final vocabulary. Defaults to 5000. Possible values depend on the language. To inspect possible values, type sentencepiece:::.bpemb$vocab_sizes and look to your language of your choice. |
dim |
dimension of the embedding. Either 25, 50, 100, 200 or 300. |
model_dir |
path to the location where the model will be downloaded to. Defaults to |
Value
a list with elements
language: the provided language
wikicode: the wikipedia code of the provided language
file_model: the path to the downloaded Sentencepiece model
url: the url where the Sentencepiece model was fetched from
download_failed: logical, indicating if the download failed
download_message: a character string with possible download failure information
glove: a list with elements file_model, url, download_failed and download_message indicating the path to the Glove embeddings in txt format. Only present if the dim argument is provided in the function. Otherwise the embeddings will not be downloaded
glove.bin: a list with elements file_model, url, download_failed and download_message indicating the path to the Glove embeddings in bin format. Only present if the dim argument is provided in the function. Otherwise the embeddings will not be downloaded
See Also
Examples
path <- getwd()
##
## Download only the tokeniser model
##
dl <- sentencepiece_download_model("Russian", vocab_size = 50000, model_dir = path)
dl <- sentencepiece_download_model("English", vocab_size = 100000, model_dir = path)
dl <- sentencepiece_download_model("French", vocab_size = 25000, model_dir = path)
dl <- sentencepiece_download_model("multi", vocab_size = 320000, model_dir = path)
dl <- sentencepiece_download_model("Vlaams", vocab_size = 1000, model_dir = path)
dl <- sentencepiece_download_model("Dutch", vocab_size = 25000, model_dir = path)
dl <- sentencepiece_download_model("nl", vocab_size = 25000, model_dir = path)
str(dl)
model <- sentencepiece_load_model(dl$file_model)
##
## Download the tokeniser model + Glove embeddings of Byte Pairs
##
dl <- sentencepiece_download_model("nl", vocab_size = 1000, dim = 50, model_dir = path)
str(dl)
model <- sentencepiece_load_model(dl$file_model)
embedding <- read_word2vec(dl$glove$file_model)
dl <- sentencepiece_download_model("nl", vocab_size = 1000, dim = 25,
model_dir = tempdir())
str(dl)