R: Prepare Tweets for topic modeling by pooling

pool_tweets {Twitmo}

R Documentation

Prepare Tweets for topic modeling by pooling

Description

This function pools a data frame of parsed tweets into document pools.

Usage

pool_tweets(
  data,
  remove_numbers = TRUE,
  remove_punct = TRUE,
  remove_symbols = TRUE,
  remove_url = TRUE,
  remove_emojis = TRUE,
  remove_users = TRUE,
  remove_hashtags = TRUE,
  cosine_threshold = 0.9,
  stopwords = "en",
  n_grams = 1L
)

Arguments

`data`	Data frame of parsed tweets. Obtained either by using `load_tweets` or `stream_in` in conjunction with `tweets_with_users`.
`remove_numbers`	Logical. If `TRUE` remove tokens that consist only of numbers, but not words that start with digits, e.g. 2day. See tokens.
`remove_punct`	Logical. If `TRUE` remove all characters in the Unicode "Punctuation" [P] class, with exceptions for those used as prefixes for valid social media tags if `preserve_tags = TRUE`. See tokens
`remove_symbols`	Logical. If `TRUE` remove all characters in the Unicode "Symbol" [S] class.
`remove_url`	Logical. If `TRUE` find and eliminate URLs beginning with http(s).
`remove_emojis`	Logical. If `TRUE` all emojis will be removed from tweets.
`remove_users`	Logical. If `TRUE` will remove all mentions of user names from documents.
`remove_hashtags`	Logical. If `TRUE` will remove hashtags (not only the symbol but the hashtagged word itself) from documents.
`cosine_threshold`	Double. Value between 0 and 1 specifying the cosine similarity threshold to be used for document pooling. Tweets without a hashtag will be assigned to document (hashtag) pools based upon this metric. Low thresholds will reduce topic coherence by including a large number of tweets without a hashtag into the document pools. Higher thresholds will lead to more coherent topics but will reduce document sizes.
`stopwords`	a character vector, list of character vectors, dictionary or collocations object. See pattern for details. Defaults to stopwords("english").
`n_grams`	Integer vector specifying the number of elements to be concatenated in each n-gram. Each element of this vector will define a n in the n-gram(s) that are produced. See tokens_ngrams

Details

Pools tweets by hashtags using cosine similarity to create longer pseudo-documents for better LDA estimation and creates n-gram tokens. The method applies an implementation of the pooling algorithm from Mehrotra et al. 2013.

Value

List with corpus object and dfm object of pooled tweets.

References

Mehrotra, Rishabh & Sanner, Scott & Buntine, Wray & Xie, Lexing. (2013). Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling. 889-892. 10.1145/2484028.2484166.

Examples


## Not run: 

library(Twitmo)

# load tweets (included in package)
mytweets <- load_tweets(system.file("extdata", "tweets_20191027-141233.json", package = "Twitmo"))

pool <- pool_tweets(data = mytweets,
                    remove_numbers = TRUE,
                    remove_punct = TRUE,
                    remove_symbols = TRUE,
                    remove_url = TRUE,
                    remove_users = TRUE,
                    remove_hashtags = TRUE,
                    remove_emojis = TRUE,
                    cosine_threshold = 0.9,
                    stopwords = "en",
                    n_grams = 1)

## End(Not run)

[Package Twitmo version 0.1.2 Index]