R: N-gram tokenizer

ngram_tokenize {SentimentAnalysis}

R Documentation

N-gram tokenizer

Description

A tokenizer for use with a document-term matrix from the tm package. Supports both character and word ngrams, including own wrapper to handle non-Latin encodings

Usage

ngram_tokenize(x, char = FALSE, ngmin = 1, ngmax = 3)

Arguments

`x`	input string
`char`	boolean value specifying whether to use character (char = TRUE) or word n-grams (char = FALSE, default)
`ngmin`	integer giving the minimum order of n-gram (default: 1)
`ngmax`	integer giving the maximum order of n-gram (default: 3)

Examples

library(tm)
en <- c("Romeo loves Juliet", "Romeo loves a girl")
en.corpus <- VCorpus(VectorSource(en))
tdm <- TermDocumentMatrix(en.corpus, 
                          control=list(wordLengths=c(1,Inf), 
                                       tokenize=function(x) ngram_tokenize(x, char=TRUE, 
                                                                           ngmin=3, ngmax=3)))
inspect(tdm)

ch <- c("abab", "aabb")
ch.corpus <- VCorpus(VectorSource(ch))
tdm <- TermDocumentMatrix(ch.corpus, 
                          control=list(wordLengths=c(1,Inf), 
                                       tokenize=function(x) ngram_tokenize(x, char=TRUE, 
                                                                           ngmin=1, ngmax=2)))
inspect(tdm)

[Package SentimentAnalysis version 1.3-5 Index]