compute_sentiment {sentometrics} | R Documentation |
Compute textual sentiment across features and lexicons
Description
Given a corpus of texts, computes sentiment per document or sentence using the valence shifting augmented bag-of-words approach, based on the lexicons provided and a choice of aggregation across words.
Usage
compute_sentiment(
x,
lexicons,
how = "proportional",
tokens = NULL,
do.sentence = FALSE,
nCore = 1
)
Arguments
x |
either a |
lexicons |
a |
how |
a single |
tokens |
a |
do.sentence |
a |
nCore |
a positive |
Details
For a separate calculation of positive (resp. negative) sentiment, provide distinct positive (resp.
negative) lexicons (see the do.split
option in the sento_lexicons
function). All NA
s
are converted to 0, under the assumption that this is equivalent to no sentiment. Per default tokens = NULL
,
meaning the corpus is internally tokenized as unigrams, with punctuation and numbers but not stopwords removed.
All tokens are converted to lowercase, in line with what the sento_lexicons
function does for the
lexicons and valence shifters. Word counts are based on that same tokenization.
Value
If x
is a sento_corpus
object: a sentiment
object, i.e., a data.table
containing
the sentiment scores data.table
with an "id"
, a "date"
and a "word_count"
column,
and all lexicon-feature sentiment scores columns. The tokenized sentences are not provided but can be
obtained as stringi::stri_split_boundaries(texts, type = "sentence")
. A sentiment
object can
be aggregated (into time series) with the aggregate.sentiment
function.
If x
is a quanteda corpus
object: a sentiment scores
data.table
with an "id"
and a "word_count"
column, and all lexicon-feature
sentiment scores columns.
If x
is a tm SimpleCorpus
object, a tm VCorpus
object, or a character
vector: a sentiment scores data.table
with an auto-created "id"
column, a "word_count"
column, and all lexicon sentiment scores columns.
When do.sentence = TRUE
, an additional "sentence_id"
column along the
"id"
column is added.
Calculation
If the lexicons
argument has no "valence"
element, the sentiment computed corresponds to simple unigram
matching with the lexicons [unigrams approach]. If valence shifters are included in lexicons
with a
corresponding "y"
column, the polarity of a word detected from a lexicon gets multiplied with the associated
value of a valence shifter if it appears right before the detected word (examples: not good or can't defend) [bigrams
approach]. If the valence table contains a "t"
column, valence shifters are searched for in a cluster centered around
a detected polarity word [clusters approach]. The latter approach is a simplified version of the one utilized by the
sentimentr package. A cluster amounts to four words before and two words after a polarity word. A cluster never overlaps
with a preceding one. Roughly speaking, the polarity of a cluster is calculated as n(1 + 0.80d)S + \sum s
. The polarity
score of the detected word is S
, s
represents polarities of eventual other sentiment words, and d
is
the difference between the number of amplifiers (t = 2
) and the number of deamplifiers (t = 3
). If there
is an odd number of negators (t = 1
), n = -1
and amplifiers are counted as deamplifiers, else n = 1
.
The sentence-level sentiment calculation approaches each sentence as if it is a document. Depending on the input either
the unigrams, bigrams or clusters approach is used. We enhanced latter approach following more closely the default
sentimentr settings. They use a cluster of five words before and two words after a polarized word. The cluster
is limited to the words after a previous comma and before a next comma. Adversative conjunctions (t = 4
) are
accounted for here. The cluster is reweighted based on the value 1 + 0.25adv
, where adv
is the difference
between the number of adversative conjunctions found before and after the polarized word.
Author(s)
Samuel Borms, Jeroen Van Pelt, Andres Algaba
Examples
data("usnews", package = "sentometrics")
txt <- system.file("texts", "txt", package = "tm")
reuters <- system.file("texts", "crude", package = "tm")
data("list_lexicons", package = "sentometrics")
data("list_valence_shifters", package = "sentometrics")
l1 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")])
l2 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")],
list_valence_shifters[["en"]])
l3 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")],
list_valence_shifters[["en"]][, c("x", "t")])
# from a sento_corpus object - unigrams approach
corpus <- sento_corpus(corpusdf = usnews)
corpusSample <- quanteda::corpus_sample(corpus, size = 200)
sent1 <- compute_sentiment(corpusSample, l1, how = "proportionalPol")
# from a character vector - bigrams approach
sent2 <- compute_sentiment(usnews[["texts"]][1:200], l2, how = "counts")
# from a corpus object - clusters approach
corpusQ <- quanteda::corpus(usnews, text_field = "texts")
corpusQSample <- quanteda::corpus_sample(corpusQ, size = 200)
sent3 <- compute_sentiment(corpusQSample, l3, how = "counts")
# from an already tokenized corpus - using the 'tokens' argument
toks <- as.list(quanteda::tokens(corpusQSample, what = "fastestword"))
sent4 <- compute_sentiment(corpusQSample, l1[1], how = "counts", tokens = toks)
# from a SimpleCorpus object - unigrams approach
scorp <- tm::SimpleCorpus(tm::DirSource(txt))
sent5 <- compute_sentiment(scorp, l1, how = "proportional")
# from a VCorpus object - unigrams approach
## in contrast to what as.sento_corpus(vcorp) would do, the
## sentiment calculator handles multiple character vectors within
## a single corpus element as separate documents
vcorp <- tm::VCorpus(tm::DirSource(reuters))
sent6 <- compute_sentiment(vcorp, l1)
# from a sento_corpus object - unigrams approach with tf-idf weighting
sent7 <- compute_sentiment(corpusSample, l1, how = "TFIDF")
# sentence-by-sentence computation
sent8 <- compute_sentiment(corpusSample, l1, how = "proportionalSquareRoot",
do.sentence = TRUE)
# from a (fake) multilingual corpus
usnews[["language"]] <- "en" # add language column
usnews$language[1:100] <- "fr"
lEn <- sento_lexicons(list("FEEL_en" = list_lexicons$FEEL_en_tr,
"HENRY" = list_lexicons$HENRY_en),
list_valence_shifters$en)
lFr <- sento_lexicons(list("FEEL_fr" = list_lexicons$FEEL_fr),
list_valence_shifters$fr)
lexicons <- list(en = lEn, fr = lFr)
corpusLang <- sento_corpus(corpusdf = usnews[1:250, ])
sent9 <- compute_sentiment(corpusLang, lexicons, how = "proportional")