BPEembed {sentencepiece} | R Documentation |
Tokenise and embed text alongside a Sentencepiece and Word2vec model
Description
Use a sentencepiece model to tokenise text and get the embeddings of these
Usage
BPEembed(
file_sentencepiece = x$file_model,
file_word2vec = x$glove.bin$file_model,
x,
normalize = TRUE
)
Arguments
file_sentencepiece |
the path to the file containing the sentencepiece model |
file_word2vec |
the path to the file containing the word2vec embeddings |
x |
the result of a call to |
normalize |
passed on to |
Value
an object of class BPEembed which is a list with elements
model: a sentencepiece model as loaded with
sentencepiece_load_model
embedding: a matrix with embeddings as loaded with
read.wordvectors
dim: the dimension of the embedding
n: the number of elements in the vocabulary
file_sentencepiece: the sentencepiece model file
file_word2vec: the word2vec embedding file
See Also
predict.BPEembed
, sentencepiece_load_model
, sentencepiece_download_model
, read.wordvectors
Examples
##
## Example loading model from disk
##
folder <- system.file(package = "sentencepiece", "models")
embedding <- file.path(folder, "nl.wiki.bpe.vs1000.d25.w2v.bin")
model <- file.path(folder, "nl.wiki.bpe.vs1000.model")
encoder <- BPEembed(model, embedding)
## Do tokenisation with the sentencepiece model + embed these
txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.",
"On est d'accord sur le prix de la biere?")
values <- predict(encoder, txt, type = "encode")
str(values)
values
txt <- rownames(values[[1]])
predict(encoder, txt, type = "decode")
txt <- lapply(values, FUN = rownames)
predict(encoder, txt, type = "decode")