control |
A list of control options which override default
settings.
First, following two options are processed.
tokenizeA function tokenizing a TextDocument
into single tokens, a Span_Tokenizer,
Token_Tokenizer, or a string matching one of the
predefined tokenization functions:
"Boost"for Boost_tokenizer, or
"MC"for MC_tokenizer, or
"scan"for scan_tokenizer, or
"words"for words.
Defaults to words.
tolowerEither a logical value indicating whether
characters should be translated to lower case or a custom function
converting characters to lower case. Defaults to
tolower.
Next, a set of options which are sensitive to the order of
occurrence in the control list. Options are processed in the
same order as specified. User-specified options have precedence over
the default ordering so that first all user-specified options and
then all remaining options (with the default settings and in the
order as listed below) are processed.
languageA character giving the language (preferably as
IETF language tags, see language in package
NLP) to be used for stopwords and stemming if
not provided by doc.
removePunctuationA logical value indicating whether
punctuation characters should be removed from
doc, a custom function which performs punctuation
removal, or a list of arguments for
removePunctuation. Defaults to FALSE.
removeNumbersA logical value indicating whether
numbers should be removed from doc or a custom function
for number removal. Defaults to FALSE.
stopwordsEither a Boolean value indicating stopword
removal using default language specific stopword lists shipped
with this package, a character vector holding custom
stopwords, or a custom function for stopword removal. Defaults
to FALSE.
stemmingEither a Boolean value indicating whether tokens
should be stemmed or a custom stemming function. Defaults to
FALSE.
Finally, following options are processed in the given order.
dictionaryA character vector to be tabulated
against. No other terms will be listed in the result. Defaults
to NULL which means that all terms in doc are
listed.
boundsA list with a tag local whose value
must be an integer vector of length 2. Terms that appear less
often in doc than the lower bound bounds$local[1]
or more often than the upper bound bounds$local[2] are
discarded. Defaults to list(local = c(1, Inf)) (i.e., every
token will be used).
wordLengthsAn integer vector of length 2. Words
shorter than the minimum word length wordLengths[1] or
longer than the maximum word length wordLengths[2] are
discarded. Defaults to c(3, Inf), i.e., a minimum word
length of 3 characters.
|