Tokenize-AsWeka {ngram} | R Documentation |
Weka-like n-gram Tokenization
Description
An n-gram tokenizer with identical output to the NGramTokenizer
function from the RWeka package.
Usage
ngram_asweka(str, min = 2, max = 2, sep = " ")
Arguments
str |
The input text. |
min , max |
The minimum and maximum 'n' as in 'n-gram'. |
sep |
A set of separator characters for the "words". See details for
information about how this works; it works a little differently
from |
Details
This n-gram tokenizer behaves similarly in both input and return to
the tokenizer in RWeka. Unlike the tokenizer ngram()
, the
return is not a special class of external pointers; it is a vector,
and therefore can be serialized via save()
or saveRDS()
.
Value
A vector of n-grams listed in decreasing blocks of n, in order within a block. The output matches that of RWeka's n-gram tokenizer.
See Also
Examples
library(ngram)
str = "A B A C A B B"
ngram_asweka(str, min=2, max=4)
[Package ngram version 3.2.3 Index]