R: Data Preprocessing

cleanTexts {tosca}

R Documentation

Data Preprocessing

Description

Removes punctuation, numbers and stopwords, changes letters into lowercase and tokenizes.

Usage

cleanTexts(
  object,
  text,
  sw = "en",
  paragraph = FALSE,
  lowercase = TRUE,
  rmPunctuation = TRUE,
  rmNumbers = TRUE,
  checkUTF8 = TRUE,
  ucp = TRUE
)

Arguments

`object`	`textmeta` object
`text`	Not necassary if `object` is specified, else should be `object\$text`: List of article texts.
`sw`	Character: Vector of stopwords. If the vector is of length one, `sw` is interpreted as argument for `stopwords` from the tm package.
`paragraph`	Logical: Should be set to `TRUE` if one article is a list of character strings, representing the paragraphs.
`lowercase`	Logical: Should be set to `TRUE` if all letters should be coerced to lowercase.
`rmPunctuation`	Logical: Should be set to `TRUE` if punctuation should be removed from articles.
`rmNumbers`	Logical: Should be set to `TRUE` if numbers should be removed from articles.
`checkUTF8`	Logical: Should be set to `TRUE` if articles should be tested on UTF-8 - which is package standard.
`ucp`	Logical: ucp option for `removePunctuation` from the tm package. Runs remove punctuation twice (ASCII and Unicode).

Details

Removes punctuation, numbers and stopwords, change into lowercase letters and tokenization. Additional some cleaning steps: remove empty words / paragraphs / article.

Value

A textmeta object or a list (if object is not specified) containing the preprocessed articles.

Examples

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

cleanTexts(object=corpus)

texts <- list(A=c("Give a Man a Fish, and You Feed Him for a Day.",
"Teach a Man To Fish, and You Feed Him for a Lifetime"),
B="So Long, and Thanks for All the Fish",
C=c("A very able manipulative mathematician,",
"Fisher enjoys a real mastery in evaluating complicated multiple integrals."))

cleanTexts(text=texts, sw = "en", paragraph = TRUE)

[Package tosca version 0.3-2 Index]