ngramTokens {doc2concrete} | R Documentation |
Ngram Tokenizer
Description
Tally bag-of-words ngram features
Usage
ngramTokens(
texts,
wstem = "all",
ngrams = 1,
language = "english",
punct = TRUE,
stop.words = TRUE,
number.words = TRUE,
per.100 = FALSE,
overlap = 1,
sparse = 0.995,
verbose = FALSE,
vocabmatch = NULL,
num.mc.cores = 1
)
Arguments
texts |
character vector of texts. |
wstem |
character Which words should be stemmed? Defaults to "all". |
ngrams |
numeric Vector of ngram lengths to be included. Default is 1 (i.e. unigrams only). |
language |
Language for stemming. Default is "english" |
punct |
logical Should punctuation be kept as tokens? Default is TRUE |
stop.words |
logical Should stop words be kept? Default is TRUE |
number.words |
logical Should numbers be kept as words? Default is TRUE |
per.100 |
logical Should counts be expressed as frequency per 100 words? Default is FALSE |
overlap |
numeric Threshold (as cosine distance) for including ngrams that constitute other included phrases. Default is 1 (i.e. all ngrams included). |
sparse |
maximum feature sparsity for inclusion (1 = include all features) |
verbose |
logical Should the package report token counts after each ngram level? Useful for long-running code. Default is FALSE. |
vocabmatch |
matrix Should the new token count matrix will be coerced to include the same tokens as a previous count matrix? Default is NULL (i.e. no token match). |
num.mc.cores |
numeric number of cores for parallel processing - see parallel::detectCores(). Default is 1. |
Details
This function produces ngram featurizations of text based on the quanteda package. This provides a complement to the doc2concrete function by demonstrating How to build a feature set for training a new detection algorithm in other contexts.
Value
a matrix of feature counts
Examples
dim(ngramTokens(feedback_dat$feedback, ngrams=1))
dim(ngramTokens(feedback_dat$feedback, ngrams=1:3))