| dataset_imdb {textdata} | R Documentation | 
IMDB Large Movie Review Dataset
Description
The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg).
Usage
dataset_imdb(
  dir = NULL,
  split = c("train", "test"),
  delete = FALSE,
  return_path = FALSE,
  clean = FALSE,
  manual_download = FALSE
)
Arguments
| dir | Character, path to directory where data will be stored. If
 | 
| split | Character. Return training ("train") data or testing ("test") data. Defaults to "train". | 
| delete | Logical, set  | 
| return_path | Logical, set  | 
| clean | Logical, set  | 
| manual_download | Logical, set  | 
Details
In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorizing movie-unique terms and their associated with observed labels. In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets. In the unsupervised set, reviews of any rating are included and there are an even number of reviews > 5 and <= 5.
When using this dataset, please cite the ACL 2011 paper
InProceedings{maas-EtAl:2011:ACL-HLT2011, 
author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher}, 
title     = {Learning Word Vectors for Sentiment Analysis}, 
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, 
month     = {June}, 
year      = {2011}, 
address   = {Portland, Oregon, USA}, 
publisher = {Association for Computational Linguistics}, 
pages     = {142–150}, 
url       = {http://www.aclweb.org/anthology/P11-1015}
}
Value
A tibble with 25,000 rows and 2 variables:
- Sentiment
- Character, denoting the sentiment 
- text
- Character, text of the review 
Source
http://ai.stanford.edu/~amaas/data/sentiment/
Examples
## Not run: 
dataset_imdb()
# Custom directory
dataset_imdb(dir = "data/")
# Deleting dataset
dataset_imdb(delete = TRUE)
# Returning filepath of data
dataset_imdb(return_path = TRUE)
# Access both training and testing dataset
train <- dataset_imdb(split = "train")
test <- dataset_imdb(split = "test")
## End(Not run)