| sentencepiece_encode {sentencepiece} | R Documentation |
Tokenise text alongside a Sentencepiece model
Description
Tokenise text alongside a Sentencepiece model
Usage
sentencepiece_encode(
model,
x,
type = c("subwords", "ids"),
nbest = -1L,
alpha = 0.1
)
Arguments
model |
an object of class |
x |
a character vector of text (in UTF-8 Encoding) |
type |
a character string, either 'subwords' or 'ids' to get the subwords or the corresponding ids of these subwords as defined in the vocabulary of the model. Defaults to 'subwords'. |
nbest |
integer indicating the number of segmentations to extract. See the details. The argument is not used if you do not provide a value for it. |
alpha |
smoothing parameter to perform subword regularisation. Typical values are 0.1, 0.2 or 0.5. See the details. The argument is not used if you do not provide a value for it or do not provide a value for |
Details
If you specify alpha to perform subword regularisation, keep in mind the following.
When alpha is 0.0, one segmentation is uniformly sampled from the nbest or lattice.
The best Viterbi segmentation is more likely sampled when setting larger alpha values like 0.1.
If you provide a positive value for
nbest, approximately samples one segmentation fromnbestcandidates.If you provide a negative value for
nbest, samples one segmentation from the hypotheses (Lattice) according to the generation probabilities using forward-filtering and backward-sampling algorithm.
nbest and alpha correspond respectively to the parameter l and in alpha
in the paper https://arxiv.org/abs/1804.10959 where (nbest < 0 means l = infinity).
If the model is a BPE model, alpha is the merge probability p explained in https://arxiv.org/abs/1910.13267.
In a BPE model, nbest-based sampling is not supported so the nbest parameter is ignored although
it still needs to be provided if you want to make use of alpha.
Value
a list with tokenised text, one for each element of x
unless you provide nbest without providing alpha in which case the result is a list of list of nbest tokenised texts
Examples
model <- system.file(package = "sentencepiece", "models", "nl-fr-dekamer.model")
model <- sentencepiece_load_model(file = model)
txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.",
"On est d'accord sur le prix de la biere?")
sentencepiece_encode(model, x = txt, type = "subwords")
sentencepiece_encode(model, x = txt, type = "ids")
## Examples using subword regularisation
model <- system.file(package = "sentencepiece", "models", "nl-fr-dekamer-unigram.model")
model <- sentencepiece_load_model(file = model)
txt <- c("Goed zo",
"On est d'accord")
sentencepiece_encode(model, x = txt, type = "subwords", nbest = 4)
sentencepiece_encode(model, x = txt, type = "ids", nbest = 4)
sentencepiece_encode(model, x = txt, type = "subwords", nbest = 2)
sentencepiece_encode(model, x = txt, type = "ids", nbest = 2)
sentencepiece_encode(model, x = txt, type = "subwords", nbest = 1)
sentencepiece_encode(model, x = txt, type = "ids", nbest = 1)
sentencepiece_encode(model, x = txt, type = "subwords", nbest = 4, alpha = 0.1)
sentencepiece_encode(model, x = txt, type = "ids", nbest = 4, alpha = 0.1)
sentencepiece_encode(model, x = txt, type = "subwords", nbest = -1, alpha = 0.1)
sentencepiece_encode(model, x = txt, type = "ids", nbest = -1, alpha = 0.1)
sentencepiece_encode(model, x = txt, type = "subwords", nbest = -1, alpha = 0)
sentencepiece_encode(model, x = txt, type = "ids", nbest = -1, alpha = 0)