bleu_corpus {sacRebleu}R Documentation

This function applies tokenization based on the 'tok' library and computes the BLEU score. An already initialized tokenizer can be provided using the 'tokenizer' argument or a valid huggingface identifier (string) can be passed. If the identifier is used only, the tokenizer is newly initialized on every call.

Description

This function applies tokenization based on the 'tok' library and computes the BLEU score. An already initialized tokenizer can be provided using the 'tokenizer' argument or a valid huggingface identifier (string) can be passed. If the identifier is used only, the tokenizer is newly initialized on every call.

Usage

bleu_corpus(
  references,
  candidates,
  tokenizer = "bert-base-cased",
  n = 4,
  weights = NULL,
  smoothing = NULL,
  epsilon = 0.1,
  k = 1
)

Arguments

references

A list of a list of reference sentences ('list(list(c(1,2,...)), list(c(3,5,...)))').

candidates

A list of candidate sentences ('list(c(1,2,...), c(3,5,...))').

tokenizer

Either an already initialized 'tok' tokenizer object or a huggingface identifier (default is 'bert-base-cased')

n

N-gram for BLEU score (default is set to 4).

weights

Weights for the n-grams (default is set to 1/n for each entry).

smoothing

Smoothing method for BLEU score (default is set to 'standard', 'floor', 'add-k' available)

epsilon

Epsilon value for epsilon-smoothing (default is set to 0.1).

k

K value for add-k-smoothing (default is set to 1).

Value

The BLEU score for the candidate sentence.

Examples

cand_corpus <- list("This is good", "This is not good")
ref_corpus <- list(list("Perfect outcome!", "Excellent!"), list("Not sufficient.", "Horrible."))

tok <- tok::tokenizer$from_pretrained("bert-base-uncased")
bleu_corpus <- bleu_corpus(ref_corpus, cand_corpus, tok)

[Package sacRebleu version 0.1.3 Index]