R: Tokenise text alongside a Sentencepiece model

sentencepiece_encode {sentencepiece}

R Documentation

Tokenise text alongside a Sentencepiece model

Description

Tokenise text alongside a Sentencepiece model

Usage

sentencepiece_encode(
  model,
  x,
  type = c("subwords", "ids"),
  nbest = -1L,
  alpha = 0.1
)

Arguments

`model`	an object of class `sentencepiece` as returned by `sentencepiece_load_model` or `sentencepiece`
`x`	a character vector of text (in UTF-8 Encoding)
`type`	a character string, either 'subwords' or 'ids' to get the subwords or the corresponding ids of these subwords as defined in the vocabulary of the model. Defaults to 'subwords'.
`nbest`	integer indicating the number of segmentations to extract. See the details. The argument is not used if you do not provide a value for it.
`alpha`	smoothing parameter to perform subword regularisation. Typical values are 0.1, 0.2 or 0.5. See the details. The argument is not used if you do not provide a value for it or do not provide a value for `nbest`.

Details

If you specify alpha to perform subword regularisation, keep in mind the following.
When alpha is 0.0, one segmentation is uniformly sampled from the nbest or lattice. The best Viterbi segmentation is more likely sampled when setting larger alpha values like 0.1.

If you provide a positive value for nbest, approximately samples one segmentation from nbest candidates.
If you provide a negative value for nbest, samples one segmentation from the hypotheses (Lattice) according to the generation probabilities using forward-filtering and backward-sampling algorithm.

nbest and alpha correspond respectively to the parameter l and in alpha in the paper https://arxiv.org/abs/1804.10959 where (nbest < 0 means l = infinity).

If the model is a BPE model, alpha is the merge probability p explained in https://arxiv.org/abs/1910.13267. In a BPE model, nbest-based sampling is not supported so the nbest parameter is ignored although it still needs to be provided if you want to make use of alpha.

Value

a list with tokenised text, one for each element of x unless you provide nbest without providing alpha in which case the result is a list of list of nbest tokenised texts

Examples

model <- system.file(package = "sentencepiece", "models", "nl-fr-dekamer.model")
model <- sentencepiece_load_model(file = model)

txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.",
         "On est d'accord sur le prix de la biere?")
sentencepiece_encode(model, x = txt, type = "subwords")
sentencepiece_encode(model, x = txt, type = "ids")

## Examples using subword regularisation
model <- system.file(package = "sentencepiece", "models", "nl-fr-dekamer-unigram.model")
model <- sentencepiece_load_model(file = model)

txt <- c("Goed zo",
         "On est d'accord")
sentencepiece_encode(model, x = txt, type = "subwords", nbest = 4)
sentencepiece_encode(model, x = txt, type = "ids", nbest = 4)
sentencepiece_encode(model, x = txt, type = "subwords", nbest = 2)
sentencepiece_encode(model, x = txt, type = "ids", nbest = 2)
sentencepiece_encode(model, x = txt, type = "subwords", nbest = 1)
sentencepiece_encode(model, x = txt, type = "ids", nbest = 1)
sentencepiece_encode(model, x = txt, type = "subwords", nbest = 4, alpha = 0.1)
sentencepiece_encode(model, x = txt, type = "ids", nbest = 4, alpha = 0.1)
sentencepiece_encode(model, x = txt, type = "subwords", nbest = -1, alpha = 0.1)
sentencepiece_encode(model, x = txt, type = "ids", nbest = -1, alpha = 0.1)
sentencepiece_encode(model, x = txt, type = "subwords", nbest = -1, alpha = 0)
sentencepiece_encode(model, x = txt, type = "ids", nbest = -1, alpha = 0)

[Package sentencepiece version 0.2.3 Index]