cleanTexts {tosca}R Documentation

Data Preprocessing

Description

Removes punctuation, numbers and stopwords, changes letters into lowercase and tokenizes.

Usage

cleanTexts(
  object,
  text,
  sw = "en",
  paragraph = FALSE,
  lowercase = TRUE,
  rmPunctuation = TRUE,
  rmNumbers = TRUE,
  checkUTF8 = TRUE,
  ucp = TRUE
)

Arguments

object

textmeta object

text

Not necassary if object is specified, else should be object\$text: List of article texts.

sw

Character: Vector of stopwords. If the vector is of length one, sw is interpreted as argument for stopwords from the tm package.

paragraph

Logical: Should be set to TRUE if one article is a list of character strings, representing the paragraphs.

lowercase

Logical: Should be set to TRUE if all letters should be coerced to lowercase.

rmPunctuation

Logical: Should be set to TRUE if punctuation should be removed from articles.

rmNumbers

Logical: Should be set to TRUE if numbers should be removed from articles.

checkUTF8

Logical: Should be set to TRUE if articles should be tested on UTF-8 - which is package standard.

ucp

Logical: ucp option for removePunctuation from the tm package. Runs remove punctuation twice (ASCII and Unicode).

Details

Removes punctuation, numbers and stopwords, change into lowercase letters and tokenization. Additional some cleaning steps: remove empty words / paragraphs / article.

Value

A textmeta object or a list (if object is not specified) containing the preprocessed articles.

Examples

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)

cleanTexts(object=corpus)

texts <- list(A=c("Give a Man a Fish, and You Feed Him for a Day.",
"Teach a Man To Fish, and You Feed Him for a Lifetime"),
B="So Long, and Thanks for All the Fish",
C=c("A very able manipulative mathematician,",
"Fisher enjoys a real mastery in evaluating complicated multiple integrals."))

cleanTexts(text=texts, sw = "en", paragraph = TRUE)


[Package tosca version 0.3-2 Index]