R: Language Model Perplexities

perplexity {kgrams}

R Documentation

Language Model Perplexities

Description

Compute language model perplexities on a test corpus.

Usage

perplexity(
  text,
  model,
  .preprocess = attr(model, ".preprocess"),
  .tknz_sent = attr(model, ".tknz_sent"),
  exp = TRUE,
  ...
)

## S3 method for class 'character'
perplexity(
  text,
  model,
  .preprocess = attr(model, ".preprocess"),
  .tknz_sent = attr(model, ".tknz_sent"),
  exp = TRUE,
  detailed = FALSE,
  ...
)

## S3 method for class 'connection'
perplexity(
  text,
  model,
  .preprocess = attr(model, ".preprocess"),
  .tknz_sent = attr(model, ".tknz_sent"),
  exp = TRUE,
  batch_size = Inf,
  ...
)

Arguments

`text`	a character vector or connection. Test corpus from which language model perplexity is computed.
`model`	an object of class `language_model`.
`.preprocess`	a function taking a character vector as input and returning a character vector as output. Preprocessing transformation applied to input before computing perplexity.
`.tknz_sent`	a function taking a character vector as input and returning a character vector as output. Optional sentence tokenization step applied before computing perplexity.
`exp`	`TRUE` or `FALSE`. If `TRUE`, returns the actual perplexity - exponential of cross-entropy per token - otherwise returns its natural logarithm.
`...`	further arguments passed to or from other methods.
`detailed`	`TRUE` or `FALSE`. If `TRUE`, the output has a `"details"` attribute, which is a data-frame containing the cross-entropy of each individual sentence tokenized from `text`.
`batch_size`	a length one positive integer or `Inf`. Size of text batches when reading text from a `connection`. If `Inf`, all input text is processed in a single batch.

Details

These generic functions are used to compute a language_model perplexity on a test corpus, which may be either a plain character vector of text, or a connection from which text can be read in batches. The second option is useful if one wants to avoid loading the full text in physical memory, and allows to process text from different sources such as files, compressed files or URLs.

"Perplexity" is defined here, following Ref. (Chen and Goodman 1999), as the exponential of the normalized language model cross-entropy with the test corpus. Cross-entropy is normalized by the total number of words in the corpus, where we include the End-Of-Sentence tokens, but not the Begin-Of-Sentence tokens, in the word count.

The custom .preprocess and .tknz_sent arguments allow to apply transformations to the text corpus before the perplexity computation takes place. By default, the same functions used during model building are employed, c.f. kgram_freqs and language_model.

A note of caution is in order. Perplexity is not defined for all language models available in kgrams. For instance, smoother "sbo" (i.e. Stupid Backoff (Brants et al. 2007)) does not produce normalized probabilities, and this is signaled by a warning (shown once per session) if the user attempts to compute the perplexity for such a model. In these cases, when possible, perplexity computations are performed anyway case, as the results might still be useful (e.g. to tune the model's parameters), even if their probabilistic interpretation does no longer hold.

Value

a number. Perplexity of the language model on the test corpus.

Author(s)

Valerio Gherardi

References

Brants T, Popat AC, Xu P, Och FJ, Dean J (2007). “Large Language Models in Machine Translation.” In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 858–867. https://aclanthology.org/D07-1090/.

Chen SF, Goodman J (1999). “An empirical study of smoothing techniques for language modeling.” Computer Speech & Language, 13(4), 359–394.

Examples

# Train 4-, 6-, and 8-gram models on Shakespeare's "Much Ado About Nothing",
# compute their perplexities on the training and test corpora.
# We use Shakespeare's "A Midsummer Night's Dream" as test.


train <- much_ado
test <- midsummer

tknz <- function(text) tknz_sent(text, keep_first = TRUE)
f <- kgram_freqs(train, 8, .tknz_sent = tknz)
m <- language_model(f, "kn", D = 0.75)

# Compute perplexities for 4-, 6-, and 8-gram models 
FUN <- function(N) {
        param(m, "N") <- N
        c(train = perplexity(train, m), test = perplexity(test, m))
        }
sapply(c("N = 4" = 4, "N = 6" = 6, "N = 8" = 8), FUN)

[Package kgrams version 0.2.0 Index]