ngram {ngram} | R Documentation |
n-gram Tokenization
Description
The ngram()
function is the main workhorse of this package. It takes
an input string and converts it into the internal n-gram representation.
Usage
ngram(str, n = 2, sep = " ")
Arguments
str |
The input text. |
n |
The 'n' as in 'n-gram'. |
sep |
A set of separator characters for the "words". See details for
information about how this works; it works a little differently
from |
Details
On evaluation, a copy of the input string is produced and stored as an external pointer. This is necessary because the internal list representation just points to the first char of each word in the input string. So if you (or R's gc) deletes the input string, basically all hell breaks loose.
The sep
parameter splits at any of the characters in
the string. So sep=", "
splits at a comma or a space.
Value
An ngram
class object.
See Also
ngram-class
, getters
,
phrasetable
, babble
Examples
library(ngram)
str = "A B A C A B B"
ngram(str, n=2)
str = "A,B,A,C A B B"
### Split at a space
print(ngram(str), output="full")
### Split at a comma
print(ngram(str, sep=","), output="full")
### Split at a space or a comma
print(ngram(str, sep=", "), output="full")