preprocess_tokens {corpustools}R Documentation

Preprocess tokens in a character vector

Description

Preprocess tokens in a character vector

Usage

preprocess_tokens(
  x,
  context = NULL,
  language = "english",
  use_stemming = F,
  lowercase = T,
  ngrams = 1,
  replace_whitespace = F,
  as_ascii = F,
  remove_punctuation = T,
  remove_stopwords = F,
  remove_numbers = F,
  min_freq = NULL,
  min_docfreq = NULL,
  max_freq = NULL,
  max_docfreq = NULL,
  min_char = NULL,
  max_char = NULL,
  ngram_skip_empty = T
)

Arguments

x

A character or factor vector in which each element is a token (i.e. a tokenized text)

context

Optionally, a character vector of the same length as x, specifying the context of token (e.g., document, sentence). Has to be given if ngram > 1

language

The language used for stemming and removing stopwords

use_stemming

Logical, use stemming. (Make sure the specify the right language!)

lowercase

Logical, make token lowercase

ngrams

A number, specifying the number of tokens per ngram. Default is unigrams (1).

replace_whitespace

Logical. If TRUE, all whitespace is replaced by underscores

as_ascii

Logical. If TRUE, tokens will be forced to ascii

remove_punctuation

Logical. if TRUE, punctuation is removed

remove_stopwords

Logical. If TRUE, stopwords are removed (Make sure to specify the right language!)

remove_numbers

remove features that are only numbers

min_freq

an integer, specifying minimum token frequency.

min_docfreq

an integer, specifying minimum document frequency.

max_freq

an integer, specifying minimum token frequency.

max_docfreq

an integer, specifying minimum document frequency.

min_char

an integer, specifying minimum number of characters in a term

max_char

an integer, specifying maximum number of characters in a term

ngram_skip_empty

if ngrams are used, determines whether empty (filtered out) terms are skipped (i.e. c("this", NA, "test"), becomes "this_test") or

Value

a factor vector

Examples

tokens = c('I', 'am', 'a', 'SHORT', 'example', 'sentence', '!')

## default is lowercase without punctuation
preprocess_tokens(tokens)

## optionally, delete stopwords, perform stemming, and make ngrams
preprocess_tokens(tokens, remove_stopwords = TRUE, use_stemming = TRUE)
preprocess_tokens(tokens, context = NA, ngrams = 3)

[Package corpustools version 0.5.1 Index]