BPEembed {sentencepiece}R Documentation

Tokenise and embed text alongside a Sentencepiece and Word2vec model

Description

Use a sentencepiece model to tokenise text and get the embeddings of these

Usage

BPEembed(
  file_sentencepiece = x$file_model,
  file_word2vec = x$glove.bin$file_model,
  x,
  normalize = TRUE
)

Arguments

file_sentencepiece

the path to the file containing the sentencepiece model

file_word2vec

the path to the file containing the word2vec embeddings

x

the result of a call to sentencepiece_download_model. If this is provided, arguments file_sentencepiece and file_word2vec will not be used.

normalize

passed on to read.wordvectors to read in file_word2vec. Defaults to TRUE.

Value

an object of class BPEembed which is a list with elements

See Also

predict.BPEembed, sentencepiece_load_model, sentencepiece_download_model, read.wordvectors

Examples

##
## Example loading model from disk
##
folder    <- system.file(package = "sentencepiece", "models")
embedding <- file.path(folder, "nl.wiki.bpe.vs1000.d25.w2v.bin")
model     <- file.path(folder, "nl.wiki.bpe.vs1000.model")
encoder   <- BPEembed(model, embedding)  

## Do tokenisation with the sentencepiece model + embed these
txt    <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.",
            "On est d'accord sur le prix de la biere?")
values <- predict(encoder, txt, type = "encode")  
str(values) 
values

txt <- rownames(values[[1]])
predict(encoder, txt, type = "decode") 
txt <- lapply(values, FUN = rownames) 
predict(encoder, txt, type = "decode") 

[Package sentencepiece version 0.2.3 Index]