imdb_dataset {torchdatasets} | R Documentation |
IMDB movie review sentiment classification dataset
Description
The format of this dataset is meant to replicate that provided by Keras.
Usage
imdb_dataset(
root,
download = FALSE,
split = "train",
shuffle = (split == "train"),
num_words = Inf,
skip_top = 0,
maxlen = Inf,
start_char = 2,
oov_char = 3,
index_from = 4
)
Arguments
root |
path to the data location |
download |
wether to download or not |
split |
train, test or valid |
shuffle |
whether to shuffle or not the dataset. |
num_words |
Words are ranked by how often they occur (in the training set),
and only the num_words most frequent words are kept. Any less frequent word
will appear as oov_char value in the sequence data. If |
skip_top |
skip the top N most frequently occurring words (which may not be informative). These words will appear as oov_char value in the dataset. Defaults to 0, so no words are skipped. |
maxlen |
int or |
start_char |
The start of a sequence will be marked with this character. Defaults to 2, because 1 is usually the padding character. |
oov_char |
int. The out-of-vocabulary character. Words that were cut out because of the num_words or skip_top limits will be replaced with this character. |
index_from |
int. Index actual words with this index and higher. |