dataset_trec {textdata} | R Documentation |
TREC dataset
Description
The TREC dataset is dataset for question classification consisting of open-domain, fact-based questions divided into broad semantic categories. It has both a six-class (TREC-6) and a fifty-class (TREC-50) version. Both have 5,452 training examples and 500 test examples, but TREC-50 has finer-grained labels. Models are evaluated based on accuracy.
Usage
dataset_trec(
dir = NULL,
split = c("train", "test"),
version = c("6", "50"),
delete = FALSE,
return_path = FALSE,
clean = FALSE,
manual_download = FALSE
)
Arguments
dir |
Character, path to directory where data will be stored. If
|
split |
Character. Return training ("train") data or testing ("test") data. Defaults to "train". |
version |
Character. Version 6("6") or version 50("50"). Defaults to "6". |
delete |
Logical, set |
return_path |
Logical, set |
clean |
Logical, set |
manual_download |
Logical, set |
Details
The classes in TREC-6 are
ABBR - Abbreviation
DESC - Description and abstract concepts
ENTY - Entities
HUM - Human beings
LOC - Locations
NYM - Numeric values
the classes in TREC-50 can be found here https://cogcomp.seas.upenn.edu/Data/QA/QC/definition.html.
Value
A tibble with 5,452 or 500 rows for "train" and "test" respectively and 2 variables:
- class
Character, denoting the class
- text
Character, question text
Source
https://cogcomp.seas.upenn.edu/Data/QA/QC/
https://trec.nist.gov/data/qa.html
See Also
Other topic:
dataset_ag_news()
,
dataset_dbpedia()
Examples
## Not run:
dataset_trec()
# Custom directory
dataset_trec(dir = "data/")
# Deleting dataset
dataset_trec(delete = TRUE)
# Returning filepath of data
dataset_trec(return_path = TRUE)
# Access both training and testing dataset
train_6 <- dataset_trec(split = "train")
test_6 <- dataset_trec(split = "test")
train_50 <- dataset_trec(split = "train", version = "50")
test_50 <- dataset_trec(split = "test", version = "50")
## End(Not run)