R: Weka-like n-gram Tokenization

Tokenize-AsWeka {ngram}

R Documentation

Weka-like n-gram Tokenization

Description

An n-gram tokenizer with identical output to the NGramTokenizer function from the RWeka package.

Usage

ngram_asweka(str, min = 2, max = 2, sep = " ")

Arguments

`str`	The input text.
`min`, `max`	The minimum and maximum 'n' as in 'n-gram'.
`sep`	A set of separator characters for the "words". See details for information about how this works; it works a little differently from `sep` arguments in R functions.

Details

This n-gram tokenizer behaves similarly in both input and return to the tokenizer in RWeka. Unlike the tokenizer ngram(), the return is not a special class of external pointers; it is a vector, and therefore can be serialized via save() or saveRDS().

Value

A vector of n-grams listed in decreasing blocks of n, in order within a block. The output matches that of RWeka's n-gram tokenizer.

Examples

library(ngram)

str = "A B A C A B B"
ngram_asweka(str, min=2, max=4)

[Package ngram version 3.2.3 Index]