cleansing_corpus {labourR}R Documentation

Cleansing Corpus

Description

The function performs text cleansing by removing escape characters, non alphanumeric, long-words, excess space, and turns all letters to lower case.

Usage

cleansing_corpus(
  text,
  escape_chars = TRUE,
  nonalphanum = TRUE,
  longwords = TRUE,
  whitespace = TRUE,
  tolower = TRUE
)

Arguments

text

Character vector of free text to be cleansed.

escape_chars

If TRUE, removes escape characters for ⁠slash n⁠, ⁠slash r⁠ and ⁠slash t⁠.

nonalphanum

If TRUE, removes non-alphanumeric characters.

longwords

If TRUE, removes words with more than 35 characters.

whitespace

If TRUE, removes excess whitespace.

tolower

If TRUE, turns letters to lower.

Value

A character vector of the cleansed text.

Examples

txt <- "It has roots in a piece of classical Latin literature from 45 BC"
cleansing_corpus(txt)

[Package labourR version 1.0.0 Index]