predict.BPEembed {sentencepiece} | R Documentation |
Encode and Decode alongside a BPEembed model
Description
Use the sentencepiece model to either
encode: tokenise and embed text
decode: get the untokenised text back of tokenised data
tokenize: only tokenize alongside the sentencepiece model
Usage
## S3 method for class 'BPEembed'
predict(object, newdata, type = c("encode", "decode", "tokenize"), ...)
Arguments
object |
an object of class BPEembed as returned by |
newdata |
a character vector of text to encode or a character vector of encoded tokens to decode or a list of those |
type |
character string, either 'encode', 'decode' or 'tokenize' |
... |
further arguments passed on to the methods |
Value
in case type is set to
'encode'
: a list of matrices containing embeddings of the text which is tokenised withsentencepiece_encode
in case type is set to
'decode'
: a character vector of decoded text as returned bysentencepiece_decode
in case type is set to
'tokenize'
: a tokenisedsentencepiece_encode
See Also
BPEembed
, sentencepiece_decode
, sentencepiece_encode
Examples
embedding <- system.file(package = "sentencepiece", "models",
"nl.wiki.bpe.vs1000.d25.w2v.bin")
model <- system.file(package = "sentencepiece", "models",
"nl.wiki.bpe.vs1000.model")
encoder <- BPEembed(model, embedding)
txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.",
"On est d'accord sur le prix de la biere?")
values <- predict(encoder, txt, type = "encode")
str(values)
values
txt <- rownames(values[[1]])
predict(encoder, txt, type = "decode")
txt <- lapply(values, FUN = rownames)
predict(encoder, txt, type = "decode")
txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.",
"On est d'accord sur le prix de la biere?")
predict(encoder, txt, type = "tokenize", "subwords")
predict(encoder, txt, type = "tokenize", "ids")