R: Ngram Tokenizer

ngramTokens {doc2concrete}

R Documentation

Ngram Tokenizer

Description

Tally bag-of-words ngram features

Usage

ngramTokens(
  texts,
  wstem = "all",
  ngrams = 1,
  language = "english",
  punct = TRUE,
  stop.words = TRUE,
  number.words = TRUE,
  per.100 = FALSE,
  overlap = 1,
  sparse = 0.995,
  verbose = FALSE,
  vocabmatch = NULL,
  num.mc.cores = 1
)

Arguments

`texts`	character vector of texts.
`wstem`	character Which words should be stemmed? Defaults to "all".
`ngrams`	numeric Vector of ngram lengths to be included. Default is 1 (i.e. unigrams only).
`language`	Language for stemming. Default is "english"
`punct`	logical Should punctuation be kept as tokens? Default is TRUE
`stop.words`	logical Should stop words be kept? Default is TRUE
`number.words`	logical Should numbers be kept as words? Default is TRUE
`per.100`	logical Should counts be expressed as frequency per 100 words? Default is FALSE
`overlap`	numeric Threshold (as cosine distance) for including ngrams that constitute other included phrases. Default is 1 (i.e. all ngrams included).
`sparse`	maximum feature sparsity for inclusion (1 = include all features)
`verbose`	logical Should the package report token counts after each ngram level? Useful for long-running code. Default is FALSE.
`vocabmatch`	matrix Should the new token count matrix will be coerced to include the same tokens as a previous count matrix? Default is NULL (i.e. no token match).
`num.mc.cores`	numeric number of cores for parallel processing - see parallel::detectCores(). Default is 1.

Details

This function produces ngram featurizations of text based on the quanteda package. This provides a complement to the doc2concrete function by demonstrating How to build a feature set for training a new detection algorithm in other contexts.

Value

a matrix of feature counts

Examples


dim(ngramTokens(feedback_dat$feedback, ngrams=1))
dim(ngramTokens(feedback_dat$feedback, ngrams=1:3))

[Package doc2concrete version 0.6.0 Index]