tCorpus$preprocess {corpustools} | R Documentation |
Preprocess feature
Description
Usage:
Arguments
column |
the column containing the feature to be used as the input |
new_column |
the column to save the preprocessed feature. Can be a new column or overwrite an existing one. |
lowercase |
make feature lowercase |
ngrams |
create ngrams. The ngrams match the rows in the token data, with the feature in the row being the last token of the ngram. For example, given the features "this is an example", the third feature ("an") will have the trigram "this_is_an". Ngrams at the beginning of a context will have empty spaces. Thus, in the previous example, the second feature ("is") will have the trigram "_is_an". |
ngram_context |
Ngrams will not be created across contexts, which can be documents or sentences. For example, if the context_level is sentences, then the last token of sentence 1 will not form an ngram with the first token of sentence 2. |
as_ascii |
convert characters to ascii. This is particularly usefull for dealing with special characters. |
remove_punctuation |
remove (i.e. make NA) any features that are only punctuation (e.g., dots, comma's) |
remove_stopwords |
remove (i.e. make NA) stopwords. (!) Make sure to set the language argument correctly. |
remove_numbers |
remove features that are only numbers |
use_stemming |
reduce features (tokens) to their stem |
language |
The language used for stopwords and stemming |
min_freq |
an integer, specifying minimum token frequency. |
min_docfreq |
an integer, specifying minimum document frequency. |
max_freq |
an integer, specifying minimum token frequency. |
max_docfreq |
an integer, specifying minimum document frequency. |
min_char |
an integer, specifying minimum number of characters in a term |
max_char |
an integer, specifying maximum number of characters in a term |
Details
## R6 method for class tCorpus. Use as tc$method (where tc is a tCorpus object).
preprocess(column='token', new_column='feature', lowercase=T, ngrams=1, ngram_context=c('document', 'sentence'), as_ascii=F, remove_punctuation=T, remove_stopwords=F, remove_numbers=F, use_stemming=F, language='english', min_freq=NULL, min_docfreq=NULL, max_freq=NULL, max_docfreq=NULL, min_char=NULL, max_char=NULL)
Examples
tc = create_tcorpus('I am a SHORT example sentence! That I am!')
## default is lowercase without punctuation
tc$preprocess('token', 'preprocessed_1')
## delete stopwords and perform stemming
tc$preprocess('token', 'preprocessed_2', remove_stopwords = TRUE, use_stemming = TRUE)
## filter on minimum frequency
tc$preprocess('token', 'preprocessed_3', min_freq=2)
## make ngrams
tc$preprocess('token', 'preprocessed_4', ngrams = 3)
tc$tokens