pool_tweets {Twitmo} | R Documentation |
Prepare Tweets for topic modeling by pooling
Description
This function pools a data frame of parsed tweets into document pools.
Usage
pool_tweets(
data,
remove_numbers = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_url = TRUE,
remove_emojis = TRUE,
remove_users = TRUE,
remove_hashtags = TRUE,
cosine_threshold = 0.9,
stopwords = "en",
n_grams = 1L
)
Arguments
data |
Data frame of parsed tweets. Obtained either by using |
remove_numbers |
Logical. If |
remove_punct |
Logical. If |
remove_symbols |
Logical. If |
remove_url |
Logical. If |
remove_emojis |
Logical. If |
remove_users |
Logical. If |
remove_hashtags |
Logical. If |
cosine_threshold |
Double. Value between 0 and 1 specifying the cosine similarity threshold to be used for document pooling. Tweets without a hashtag will be assigned to document (hashtag) pools based upon this metric. Low thresholds will reduce topic coherence by including a large number of tweets without a hashtag into the document pools. Higher thresholds will lead to more coherent topics but will reduce document sizes. |
stopwords |
a character vector, list of character vectors, dictionary or collocations object. See pattern for details. Defaults to stopwords("english"). |
n_grams |
Integer vector specifying the number of elements to be concatenated in each n-gram. Each element of this vector will define a n in the n-gram(s) that are produced. See tokens_ngrams |
Details
Pools tweets by hashtags using cosine similarity to create longer pseudo-documents for better LDA estimation and creates n-gram tokens. The method applies an implementation of the pooling algorithm from Mehrotra et al. 2013.
Value
List with corpus object and dfm object of pooled tweets.
References
Mehrotra, Rishabh & Sanner, Scott & Buntine, Wray & Xie, Lexing. (2013). Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling. 889-892. 10.1145/2484028.2484166.
See Also
Examples
## Not run:
library(Twitmo)
# load tweets (included in package)
mytweets <- load_tweets(system.file("extdata", "tweets_20191027-141233.json", package = "Twitmo"))
pool <- pool_tweets(data = mytweets,
remove_numbers = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_url = TRUE,
remove_users = TRUE,
remove_hashtags = TRUE,
remove_emojis = TRUE,
cosine_threshold = 0.9,
stopwords = "en",
n_grams = 1)
## End(Not run)