| ngram-class {ngram} | R Documentation |
Class ngram
Description
An n-gram is an ordered sequence of n "words" taken from a body of "text". The terms "words" and "text" can easily be interpreted literally, or with a more loose interpretation.
Details
For example, consider the sequence "A B A C A B B". If we examine the 2-grams (or bigrams) of this sequence, they are
A B, B A, A C, C A, A B, B B
or without repetition:
A B, B A, A C, C A, B B
That is, we take the input string and group the "words" 2 at a time (because
n=2). Notice that the number of n-grams and the number of words are
not obviously related; counting repetition, the number of n-grams is equal
to
nwords - n + 1
Bounds ignoring repetition are highly dependent on the input. A correct but useless bound is
\#ngrams = nwords - (\#repeats - 1) - (n - 1)
An ngram object is an S4 class container that stores some basic
summary information (e.g., n), and several external pointers. For
information on how to construct an ngram object, see
ngram.
Slots
str_ptrA pointer to a copy of the original input string.
strlenThe length of the string.
nThe eponymous 'n' as in 'n-gram'.
ngl_ptrA pointer to the processed list of n-grams.
ngsizeThe length of the ngram list, or in other words, the number of unique n-grams in the input string.
sl_ptrA pointer to the list of words from the input string.