BPEembedder {sentencepiece} | R Documentation |
Build a BPEembed model containing a Sentencepiece and Word2vec model
Description
Build a sentencepiece model on text and build a matching word2vec model on the sentencepiece vocabulary
Usage
BPEembedder(
x,
tokenizer = c("bpe", "char", "unigram", "word"),
args = list(vocab_size = 8000, coverage = 0.9999),
...
)
Arguments
x |
a data.frame with columns doc_id and text |
tokenizer |
character string with the type of sentencepiece tokenizer. Either 'bpe', 'char', 'unigram' or 'word' for Byte Pair Encoding, Character level encoding,
Unigram encoding or pretokenised word encoding. Defaults to 'bpe' (Byte Pair Encoding). Passed on to |
args |
a list of arguments passed on to |
... |
arguments passed on to |
Value
an object of class BPEembed which is a list with elements
model: a sentencepiece model as loaded with
sentencepiece_load_model
embedding: a matrix with embeddings as loaded with
read.wordvectors
dim: the dimension of the embedding
n: the number of elements in the vocabulary
file_sentencepiece: the sentencepiece model file
file_word2vec: the word2vec embedding file
See Also
sentencepiece
, word2vec
, predict.BPEembed
Examples
library(tokenizers.bpe)
data(belgium_parliament, package = "tokenizers.bpe")
x <- subset(belgium_parliament, language %in% "dutch")
model <- BPEembedder(x, tokenizer = "bpe", args = list(vocab_size = 1000),
type = "cbow", dim = 20, iter = 10)
model
txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.")
values <- predict(model, txt, type = "encode")