control |
A list of control options which override default
settings.
First, following two options are processed.
tokenize A function tokenizing a TextDocument
into single tokens, a Span_Tokenizer ,
Token_Tokenizer , or a string matching one of the
predefined tokenization functions:
"Boost" for Boost_tokenizer , or
"MC" for MC_tokenizer , or
"scan" for scan_tokenizer , or
"words" for words .
Defaults to words .
tolower Either a logical value indicating whether
characters should be translated to lower case or a custom function
converting characters to lower case. Defaults to
tolower .
Next, a set of options which are sensitive to the order of
occurrence in the control list. Options are processed in the
same order as specified. User-specified options have precedence over
the default ordering so that first all user-specified options and
then all remaining options (with the default settings and in the
order as listed below) are processed.
language A character giving the language (preferably as
IETF language tags, see language in package
NLP) to be used for stopwords and stemming if
not provided by doc .
removePunctuation A logical value indicating whether
punctuation characters should be removed from
doc , a custom function which performs punctuation
removal, or a list of arguments for
removePunctuation . Defaults to FALSE .
removeNumbers A logical value indicating whether
numbers should be removed from doc or a custom function
for number removal. Defaults to FALSE .
stopwords Either a Boolean value indicating stopword
removal using default language specific stopword lists shipped
with this package, a character vector holding custom
stopwords, or a custom function for stopword removal. Defaults
to FALSE .
stemming Either a Boolean value indicating whether tokens
should be stemmed or a custom stemming function. Defaults to
FALSE .
Finally, following options are processed in the given order.
dictionary A character vector to be tabulated
against. No other terms will be listed in the result. Defaults
to NULL which means that all terms in doc are
listed.
bounds A list with a tag local whose value
must be an integer vector of length 2. Terms that appear less
often in doc than the lower bound bounds$local[1]
or more often than the upper bound bounds$local[2] are
discarded. Defaults to list(local = c(1, Inf)) (i.e., every
token will be used).
wordLengths An integer vector of length 2. Words
shorter than the minimum word length wordLengths[1] or
longer than the maximum word length wordLengths[2] are
discarded. Defaults to c(3, Inf) , i.e., a minimum word
length of 3 characters.
|