sbo_predictions {sbo} | R Documentation |
Stupid Back-off text predictions
Description
Train a text predictor via Stupid Back-off
Usage
sbo_predictor(object, ...)
predictor(object, ...)
## S3 method for class 'character'
sbo_predictor(
object,
N,
dict,
.preprocess = identity,
EOS = "",
lambda = 0.4,
L = 3L,
filtered = "<UNK>",
...
)
## S3 method for class 'sbo_kgram_freqs'
sbo_predictor(object, lambda = 0.4, L = 3L, filtered = "<UNK>", ...)
## S3 method for class 'sbo_predtable'
sbo_predictor(object, ...)
sbo_predtable(object, lambda = 0.4, L = 3L, filtered = "<UNK>", ...)
predtable(object, lambda = 0.4, L = 3L, filtered = "<UNK>", ...)
## S3 method for class 'character'
sbo_predtable(
object,
lambda = 0.4,
L = 3L,
filtered = "<UNK>",
N,
dict,
.preprocess = identity,
EOS = "",
...
)
## S3 method for class 'sbo_kgram_freqs'
sbo_predtable(object, lambda = 0.4, L = 3L, filtered = "<UNK>", ...)
Arguments
object |
either a character vector or an object inheriting from classes
|
... |
further arguments passed to or from other methods. |
N |
a length one integer. Order 'N' of the N-gram model. |
dict |
a |
.preprocess |
a function for corpus preprocessing. For
more details see |
EOS |
a length one character vector. String listing End-Of-Sentence
characters. For more details see |
lambda |
a length one numeric. Penalization in the Stupid Back-off algorithm. |
L |
a length one integer. Maximum number of next-word predictions for a given input (top scoring predictions are retained). |
filtered |
a character vector. Words to exclude from next-word predictions. The strings '<UNK>' and '<EOS>' are reserved keywords referring to the Unknown-Word and End-Of-Sentence tokens, respectively. |
Details
These functions are generics used to train a text predictor
with Stupid Back-Off. The functions predictor()
and
predtable()
are aliases for sbo_predictor()
and
sbo_predtable()
, respectively.
The sbo_predictor
data structure carries
all information
required for prediction in a compact and efficient (upon retrieval) way,
by directly storing the top L
next-word predictions for each
k-gram prefix observed in the training corpus.
The sbo_predictor
objects are for interactive use. If the training
process is computationally heavy, one can store a "raw" version of the
text predictor in a sbo_predtable
class object, which can be safely
saved out of memory (with e.g. save()
).
The resulting object can be restored
in another R session, and the corresponding sbo_predictor
object
can be loaded rapidly using again the generic constructor
sbo_predictor()
(see example below).
The returned objects are a sbo_predictor
and a sbo_predtable
objects.
The latter contains Stupid Back-Off prediction tables, storing next-word
prediction for each k-gram prefix observed in the text, whereas the former
is an external pointer to an equivalent (but processed) C++ structure.
Both objects have the following attributes:
-
N
: The order of the underlying N-gram model, "N
". -
dict
: The model dictionary. -
lambda
: The penalization used in the Stupid Back-Off algorithm. -
L
: The maximum number of next-word predictions for a given text input. -
.preprocess
: The function used for text preprocessing. -
EOS
: A length one character vector listing all (single character) end-of-sentence tokens.
Value
A sbo_predictor
object for sbo_predictor()
, a
sbo_predtable
object for sbo_predtable()
.
Author(s)
Valerio Gherardi
See Also
Examples
# Train a text predictor directly from corpus
p <- sbo_predictor(twitter_train, N = 3, dict = max_size ~ 1000,
.preprocess = preprocess, EOS = ".?!:;")
# Train a text predictor from previously computed 'kgram_freqs' object
p <- sbo_predictor(twitter_freqs)
# Load a text predictor from a Stupid Back-Off prediction table
p <- sbo_predictor(twitter_predtable)
# Predict from Stupid Back-Off text predictor
p <- sbo_predictor(twitter_predtable)
predict(p, "i love")
# Build Stupid Back-Off prediction tables directly from corpus
t <- sbo_predtable(twitter_train, N = 3, dict = max_size ~ 1000,
.preprocess = preprocess, EOS = ".?!:;")
# Build Stupid Back-Off prediction tables from kgram_freqs object
t <- sbo_predtable(twitter_freqs)
## Not run:
# Save and reload a 'sbo_predtable' object with base::save()
save(t)
load("t.rda")
## End(Not run)