sento_corpus {sentometrics} | R Documentation |
Create a sento_corpus object
Description
Formalizes a collection of texts into a sento_corpus
object derived from the quanteda
corpus
object. The quanteda package provides a robust text mining infrastructure
(see their website), including a handy corpus manipulation toolset. This function
performs a set of checks on the input data and prepares the corpus for further analysis by structurally
integrating a date dimension and numeric metadata features.
Usage
sento_corpus(corpusdf, do.clean = FALSE)
Arguments
corpusdf |
a |
do.clean |
a |
Details
A sento_corpus
object is a specialized instance of a quanteda corpus
. Any
quanteda function applicable to its corpus
object can also be applied to a sento_corpus
object. However, changing a given sento_corpus
object too drastically using some of quanteda's functions might
alter the very structure the corpus is meant to have (as defined in the corpusdf
argument) to be able to be used as
an input in other functions of the sentometrics package. There are functions, including
corpus_sample
or corpus_subset
, that do not change the actual corpus
structure and may come in handy.
To add additional features, use add_features
. Binary features are useful as
a mechanism to select the texts which have to be integrated in the respective feature-based sentiment measure(s), but
applies only when do.ignoreZeros = TRUE
. Because of this (implicit) selection that can be performed, having
complementary features (e.g., "economy"
and "noneconomy"
) makes sense.
It is also possible to add one non-numerical feature, that is, "language"
, to designate the language
of the corpus texts. When this feature is provided, a list
of lexicons for different
languages is expected in the compute_sentiment
function.
Value
A sento_corpus
object, derived from a quanteda corpus
object. The corpus is ordered by date.
Author(s)
Samuel Borms
See Also
Examples
data("usnews", package = "sentometrics")
# corpus construction
corp <- sento_corpus(corpusdf = usnews)
# take a random subset making use of quanteda
corpusSmall <- quanteda::corpus_sample(corp, size = 500)
# deleting a feature
quanteda::docvars(corp, field = "wapo") <- NULL
# deleting all features results in the addition of a dummy feature
quanteda::docvars(corp, field = c("economy", "noneconomy", "wsj")) <- NULL
## Not run:
# to add or replace features, use the add_features() function...
quanteda::docvars(corp, field = c("wsj", "new")) <- 1
## End(Not run)
# corpus creation when no features are present
corpusDummy <- sento_corpus(corpusdf = usnews[, 1:3])
# corpus creation with a qualitative language feature
usnews[["language"]] <- "en"
usnews[["language"]][c(200:400)] <- "nl"
corpusLang <- sento_corpus(corpusdf = usnews)