ngram_tokenize {SentimentAnalysis} | R Documentation |
N-gram tokenizer
Description
A tokenizer for use with a document-term matrix from the tm package. Supports both character and word ngrams, including own wrapper to handle non-Latin encodings
Usage
ngram_tokenize(x, char = FALSE, ngmin = 1, ngmax = 3)
Arguments
x |
input string |
char |
boolean value specifying whether to use character (char = TRUE) or word n-grams (char = FALSE, default) |
ngmin |
integer giving the minimum order of n-gram (default: 1) |
ngmax |
integer giving the maximum order of n-gram (default: 3) |
Examples
library(tm)
en <- c("Romeo loves Juliet", "Romeo loves a girl")
en.corpus <- VCorpus(VectorSource(en))
tdm <- TermDocumentMatrix(en.corpus,
control=list(wordLengths=c(1,Inf),
tokenize=function(x) ngram_tokenize(x, char=TRUE,
ngmin=3, ngmax=3)))
inspect(tdm)
ch <- c("abab", "aabb")
ch.corpus <- VCorpus(VectorSource(ch))
tdm <- TermDocumentMatrix(ch.corpus,
control=list(wordLengths=c(1,Inf),
tokenize=function(x) ngram_tokenize(x, char=TRUE,
ngmin=1, ngmax=2)))
inspect(tdm)
[Package SentimentAnalysis version 1.3-5 Index]