R: Tokenise and embed text alongside a Sentencepiece and...

BPEembed {sentencepiece}

R Documentation

Tokenise and embed text alongside a Sentencepiece and Word2vec model

Description

Use a sentencepiece model to tokenise text and get the embeddings of these

Usage

BPEembed(
  file_sentencepiece = x$file_model,
  file_word2vec = x$glove.bin$file_model,
  x,
  normalize = TRUE
)

Arguments

`file_sentencepiece`	the path to the file containing the sentencepiece model
`file_word2vec`	the path to the file containing the word2vec embeddings
`x`	the result of a call to `sentencepiece_download_model`. If this is provided, arguments `file_sentencepiece` and `file_word2vec` will not be used.
`normalize`	passed on to `read.wordvectors` to read in `file_word2vec`. Defaults to `TRUE`.

Value

an object of class BPEembed which is a list with elements

model: a sentencepiece model as loaded with sentencepiece_load_model
embedding: a matrix with embeddings as loaded with read.wordvectors
dim: the dimension of the embedding
n: the number of elements in the vocabulary
file_sentencepiece: the sentencepiece model file
file_word2vec: the word2vec embedding file

Examples

##
## Example loading model from disk
##
folder    <- system.file(package = "sentencepiece", "models")
embedding <- file.path(folder, "nl.wiki.bpe.vs1000.d25.w2v.bin")
model     <- file.path(folder, "nl.wiki.bpe.vs1000.model")
encoder   <- BPEembed(model, embedding)  

## Do tokenisation with the sentencepiece model + embed these
txt    <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.",
            "On est d'accord sur le prix de la biere?")
values <- predict(encoder, txt, type = "encode")  
str(values) 
values

txt <- rownames(values[[1]])
predict(encoder, txt, type = "decode") 
txt <- lapply(values, FUN = rownames) 
predict(encoder, txt, type = "decode")

[Package sentencepiece version 0.2.3 Index]

Tokenise and embed text alongside a Sentencepiece and Word2vec model

Description

Usage

Arguments

Value

See Also

Examples