cleanTexts {tosca} | R Documentation |
Data Preprocessing
Description
Removes punctuation, numbers and stopwords, changes letters into lowercase and tokenizes.
Usage
cleanTexts(
object,
text,
sw = "en",
paragraph = FALSE,
lowercase = TRUE,
rmPunctuation = TRUE,
rmNumbers = TRUE,
checkUTF8 = TRUE,
ucp = TRUE
)
Arguments
object |
|
text |
Not necassary if |
sw |
Character: Vector of stopwords. If the vector is of length
one, |
paragraph |
Logical: Should be set to |
lowercase |
Logical: Should be set to |
rmPunctuation |
Logical: Should be set to |
rmNumbers |
Logical: Should be set to |
checkUTF8 |
Logical: Should be set to |
ucp |
Logical: ucp option for |
Details
Removes punctuation, numbers and stopwords, change into lowercase letters and tokenization. Additional some cleaning steps: remove empty words / paragraphs / article.
Value
A textmeta
object or a list (if object
is not specified) containing the preprocessed articles.
Examples
texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")
corpus <- textmeta(meta=data.frame(id=c("A", "B", "C", "D"),
title=c("Fishing", "Don't panic!", "Sir Ronald", "Berlin"),
date=c("1885-01-02", "1979-03-04", "1951-05-06", "1967-06-02"),
additionalVariable=1:4, stringsAsFactors=FALSE), text=texts)
cleanTexts(object=corpus)
texts <- list(A=c("Give a Man a Fish, and You Feed Him for a Day.",
"Teach a Man To Fish, and You Feed Him for a Lifetime"),
B="So Long, and Thanks for All the Fish",
C=c("A very able manipulative mathematician,",
"Fisher enjoys a real mastery in evaluating complicated multiple integrals."))
cleanTexts(text=texts, sw = "en", paragraph = TRUE)