R: Class ngram

ngram-class {ngram}

R Documentation

Class ngram

Description

An n-gram is an ordered sequence of n "words" taken from a body of "text". The terms "words" and "text" can easily be interpreted literally, or with a more loose interpretation.

Details

For example, consider the sequence "A B A C A B B". If we examine the 2-grams (or bigrams) of this sequence, they are

A B, B A, A C, C A, A B, B B

or without repetition:

A B, B A, A C, C A, B B

That is, we take the input string and group the "words" 2 at a time (because n=2). Notice that the number of n-grams and the number of words are not obviously related; counting repetition, the number of n-grams is equal to

nwords - n + 1

Bounds ignoring repetition are highly dependent on the input. A correct but useless bound is

\#ngrams = nwords - (\#repeats - 1) - (n - 1)

An ngram object is an S4 class container that stores some basic summary information (e.g., n), and several external pointers. For information on how to construct an ngram object, see ngram.

Slots

str_ptr: A pointer to a copy of the original input string.
strlen: The length of the string.
n: The eponymous 'n' as in 'n-gram'.
ngl_ptr: A pointer to the processed list of n-grams.
ngsize: The length of the ngram list, or in other words, the number of unique n-grams in the input string.
sl_ptr: A pointer to the list of words from the input string.

Class ngram

Description

Details

Slots

See Also