R: Generates dictionary of decisive terms

generateDictionary {SentimentAnalysis}

R Documentation

Generates dictionary of decisive terms

Description

Routine applies method for dictionary generation (LASSO, ridge regularization, elastic net, ordinary least squares, generalized linear model or spike-and-slab regression) to the document-term matrix in order to extract decisive terms that have a statistically significant impact on the response variable.

Usage

generateDictionary(
  x,
  response,
  language = "english",
  modelType = "lasso",
  filterTerms = NULL,
  control = list(),
  minWordLength = 3,
  sparsity = 0.9,
  weighting = function(x) tm::weightTfIdf(x, normalize = FALSE),
  ...
)

## S3 method for class 'Corpus'
generateDictionary(
  x,
  response,
  language = "english",
  modelType = "lasso",
  filterTerms = NULL,
  control = list(),
  minWordLength = 3,
  sparsity = 0.9,
  weighting = function(x) tm::weightTfIdf(x, normalize = FALSE),
  ...
)

## S3 method for class 'character'
generateDictionary(
  x,
  response,
  language = "english",
  modelType = "lasso",
  filterTerms = NULL,
  control = list(),
  minWordLength = 3,
  sparsity = 0.9,
  weighting = function(x) tm::weightTfIdf(x, normalize = FALSE),
  ...
)

## S3 method for class 'data.frame'
generateDictionary(
  x,
  response,
  language = "english",
  modelType = "lasso",
  filterTerms = NULL,
  control = list(),
  minWordLength = 3,
  sparsity = 0.9,
  weighting = function(x) tm::weightTfIdf(x, normalize = FALSE),
  ...
)

## S3 method for class 'TermDocumentMatrix'
generateDictionary(
  x,
  response,
  language = "english",
  modelType = "lasso",
  filterTerms = NULL,
  control = list(),
  minWordLength = 3,
  sparsity = 0.9,
  weighting = function(x) tm::weightTfIdf(x, normalize = FALSE),
  ...
)

## S3 method for class 'DocumentTermMatrix'
generateDictionary(
  x,
  response,
  language = "english",
  modelType = "lasso",
  filterTerms = NULL,
  control = list(),
  minWordLength = 3,
  sparsity = 0.9,
  weighting = function(x) tm::weightTfIdf(x, normalize = FALSE),
  ...
)

Arguments

`x`	A vector of characters, a `data.frame`, an object of type `Corpus`, `TermDocumentMatrix` or `DocumentTermMatrix`.
`response`	Response variable including the given gold standard.
`language`	Language used for preprocessing operations (default: English).
`modelType`	A string denoting the estimation method. Allowed values are `lasso`, `ridge`, `enet`, `lm` or `glm` or `spikeslab`.
`filterTerms`	Optional vector of strings (default: `NULL`) to filter terms that are used for dictionary generation.
`control`	(optional) A list of parameters defining the model used for dictionary generation. If `modelType=lasso` is selected, individual parameters are as follows: "s" Value of the parameter lambda at which the LASSO is evaluated. Default is `s="lambda.1se"` which takes the calculated minimum value for `\lambda` and then subtracts one standard error in order to avoid overfitting. This often results in a better performance than using the minimum value itself given by `lambda="lambda.min"`. "family" Distribution for response variable. Default is `family="gaussian"`. For non-negative counts, use `family="poisson"`. For binary variables `family="binomial"`. See `glmnet` for further details. "grouped" Determines whether grouped LASSO is used (with default `FALSE`). If `modelType=ridge` is selected, individual parameters are as follows: "s" Value of the parameter lambda at which the ridge is evaluated. Default is `s="lambda.1se"` which takes the calculated minimum value for `\lambda` and then subtracts one standard error in order to avoid overfitting. This often results in a better performance than using the minimum value itself given by `lambda="lambda.min"`. "family" Distribution for response variable. Default is `family="gaussian"`. For non-negative counts, use `family="poisson"`. For binary variables `family="binomial"`. See `glmnet` for further details. "grouped" Determines whether grouped function is used (with default `FALSE`). If `modelType=enet` is selected, individual parameters are as follows: "alpha" Abstraction parameter for switching between LASSO (with `alpha=1`) and ridge regression (`alpha=0`). Default is `alpha=0.5`. Recommended option is to test different values between 0 and 1. "s" Value of the parameter lambda at which the elastic net is evaluated. Default is `s="lambda.1se"` which takes the calculated minimum value for `\lambda` and then subtracts one standard error in order to avoid overfitting. This often results in a better performance than using the minimum value itself given by `lambda="lambda.min"`. "family" Distribution for response variable. Default is `family="gaussian"`. For non-negative counts, use `family="poisson"`. For binary variables `family="binomial"`. See `glmnet` for further details. "grouped" Determines whether grouped function is used (with default `FALSE`). If `modelType=lm` is selected, no parameters are passed on. If `modelType=glm` is selected, individual parameters are as follows: "family" Distribution for response variable. Default is `family="gaussian"`. For non-negative counts, use `family="poisson"`. For binary variables `family="binomial"`. See `glm` for further details. If `modelType=spikeslab` is selected, individual parameters are as follows: "n.iter1" Number of burn-in Gibbs sampled values (i.e., discarded values). Default is 500. "n.iter2" Number of Gibbs sampled values, following burn-in. Default is 500.
`minWordLength`	Removes words given a specific minimum length (default: 3). This preprocessing is applied when the input is a character vector or a corpus and the document-term matrix is generated inside the routine.
`sparsity`	A numeric for removing sparse terms in the document-term matrix. The argument `sparsity` specifies the maximal allowed sparsity. Default is `sparsity=0.9`, however, this is only applied when the document-term matrix is calculated inside the routine.
`weighting`	Weights a document-term matrix by e.g. term frequency - inverse document frequency (default). Other variants can be used from `DocumentTermMatrix`.
`...`	Additional parameters passed to function for e.g. preprocessing or `glmnet`.

Value

Result is a matrix which sentiment values for each document across all defined rules

Source

doi:10.1371/journal.pone.0209323

References

Pr\"ollochs and Feuerriegel (2018). Statistical inferences for Polarity Identification in Natural Language, PloS One 13(12).

Examples

# Create a vector of strings
documents <- c("This is a good thing!",
               "This is a very good thing!",
               "This is okay.",
               "This is a bad thing.",
               "This is a very bad thing.")
response <- c(1, 0.5, 0, -0.5, -1)

# Generate dictionary with LASSO regularization
dictionary <- generateDictionary(documents, response)

# Show dictionary
dictionary
summary(dictionary)
plot(dictionary)

# Compute in-sample performance
sentiment <- predict(dictionary, documents)
compareToResponse(sentiment, response)
plotSentimentResponse(sentiment, response)

# Generate new dictionary with spike-and-slab regression instead of LASSO regularization
library(spikeslab)
dictionary <- generateDictionary(documents, response, modelType="spikeslab")

# Generate new dictionary with tf weighting instead of tf-idf

library(tm)
dictionary <- generateDictionary(documents, response, weighting=weightTf)
sentiment <- predict(dictionary, documents)
compareToResponse(sentiment, response)

# Use instead lambda.min from the LASSO estimation
dictionary <- generateDictionary(documents, response, control=list(s="lambda.min"))
sentiment <- predict(dictionary, documents)
compareToResponse(sentiment, response)

# Use instead OLS as estimation method
dictionary <- generateDictionary(documents, response, modelType="lm")
sentiment <- predict(dictionary, documents)
sentiment

dictionary <- generateDictionary(documents, response, modelType="lm", 
                                 filterTerms = c("good", "bad"))
sentiment <- predict(dictionary, documents)
sentiment

dictionary <- generateDictionary(documents, response, modelType="lm", 
                                 filterTerms = extractWords(loadDictionaryGI()))
sentiment <- predict(dictionary, documents)
sentiment

# Generate dictionary without LASSO intercept
dictionary <- generateDictionary(documents, response, intercept=FALSE)
dictionary$intercept
 
## Not run: 
imdb <- loadImdb()

# Generate Dictionary
dictionary_imdb <- generateDictionary(imdb$Corpus, imdb$Rating, family="poisson")
summary(dictionary_imdb)

compareDictionaries(dictionary_imdb,
                    loadDictionaryGI())
                    
# Show estimated coefficients with Kernel Density Estimation (KDE)
plot(dictionary_imdb)
plot(dictionary_imdb) + xlim(c(-0.1, 0.1))

# Compute in-sample performance
pred_sentiment <- predict(dict_imdb, imdb$Corpus)
compareToResponse(pred_sentiment, imdb$Rating)

# Test a different sparsity parameter
dictionary_imdb <- generateDictionary(imdb$Corpus, imdb$Rating, family="poisson", sparsity=0.99)
summary(dictionary_imdb)
pred_sentiment <- predict(dict_imdb, imdb$Corpus)
compareToResponse(pred_sentiment, imdb$Rating)

## End(Not run)

[Package SentimentAnalysis version 1.3-5 Index]