tCorpus$preprocess {corpustools}R Documentation

Preprocess feature

Description

Usage:

Arguments

column

the column containing the feature to be used as the input

new_column

the column to save the preprocessed feature. Can be a new column or overwrite an existing one.

lowercase

make feature lowercase

ngrams

create ngrams. The ngrams match the rows in the token data, with the feature in the row being the last token of the ngram. For example, given the features "this is an example", the third feature ("an") will have the trigram "this_is_an". Ngrams at the beginning of a context will have empty spaces. Thus, in the previous example, the second feature ("is") will have the trigram "_is_an".

ngram_context

Ngrams will not be created across contexts, which can be documents or sentences. For example, if the context_level is sentences, then the last token of sentence 1 will not form an ngram with the first token of sentence 2.

as_ascii

convert characters to ascii. This is particularly usefull for dealing with special characters.

remove_punctuation

remove (i.e. make NA) any features that are only punctuation (e.g., dots, comma's)

remove_stopwords

remove (i.e. make NA) stopwords. (!) Make sure to set the language argument correctly.

remove_numbers

remove features that are only numbers

use_stemming

reduce features (tokens) to their stem

language

The language used for stopwords and stemming

min_freq

an integer, specifying minimum token frequency.

min_docfreq

an integer, specifying minimum document frequency.

max_freq

an integer, specifying minimum token frequency.

max_docfreq

an integer, specifying minimum document frequency.

min_char

an integer, specifying minimum number of characters in a term

max_char

an integer, specifying maximum number of characters in a term

Details

## R6 method for class tCorpus. Use as tc$method (where tc is a tCorpus object).

preprocess(column='token', new_column='feature', lowercase=T, ngrams=1,
           ngram_context=c('document', 'sentence'), as_ascii=F, remove_punctuation=T,
           remove_stopwords=F, remove_numbers=F, use_stemming=F, language='english',
           min_freq=NULL, min_docfreq=NULL, max_freq=NULL, max_docfreq=NULL, min_char=NULL, max_char=NULL)
           

Examples

tc = create_tcorpus('I am a SHORT example sentence! That I am!')

## default is lowercase without punctuation
tc$preprocess('token', 'preprocessed_1')

## delete stopwords and perform stemming
tc$preprocess('token', 'preprocessed_2', remove_stopwords = TRUE, use_stemming = TRUE)

## filter on minimum frequency
tc$preprocess('token', 'preprocessed_3', min_freq=2)

## make ngrams
tc$preprocess('token', 'preprocessed_4', ngrams = 3)

tc$tokens

[Package corpustools version 0.5.1 Index]