R: TREC dataset

dataset_trec {textdata}

R Documentation

TREC dataset

Description

The TREC dataset is dataset for question classification consisting of open-domain, fact-based questions divided into broad semantic categories. It has both a six-class (TREC-6) and a fifty-class (TREC-50) version. Both have 5,452 training examples and 500 test examples, but TREC-50 has finer-grained labels. Models are evaluated based on accuracy.

Usage

dataset_trec(
  dir = NULL,
  split = c("train", "test"),
  version = c("6", "50"),
  delete = FALSE,
  return_path = FALSE,
  clean = FALSE,
  manual_download = FALSE
)

Arguments

`dir`	Character, path to directory where data will be stored. If `NULL`, user_cache_dir will be used to determine path.
`split`	Character. Return training ("train") data or testing ("test") data. Defaults to "train".
`version`	Character. Version 6("6") or version 50("50"). Defaults to "6".
`delete`	Logical, set `TRUE` to delete dataset.
`return_path`	Logical, set `TRUE` to return the path of the dataset.
`clean`	Logical, set `TRUE` to remove intermediate files. This can greatly reduce the size. Defaults to FALSE.
`manual_download`	Logical, set `TRUE` if you have manually downloaded the file and placed it in the folder designated by running this function with `return_path = TRUE`.

Details

The classes in TREC-6 are

ABBR - Abbreviation
DESC - Description and abstract concepts
ENTY - Entities
HUM - Human beings
LOC - Locations
NYM - Numeric values

the classes in TREC-50 can be found here https://cogcomp.seas.upenn.edu/Data/QA/QC/definition.html.

Value

A tibble with 5,452 or 500 rows for "train" and "test" respectively and 2 variables:

class: Character, denoting the class
text: Character, question text

Source

https://cogcomp.seas.upenn.edu/Data/QA/QC/

https://trec.nist.gov/data/qa.html

Examples

## Not run: 
dataset_trec()

# Custom directory
dataset_trec(dir = "data/")

# Deleting dataset
dataset_trec(delete = TRUE)

# Returning filepath of data
dataset_trec(return_path = TRUE)

# Access both training and testing dataset
train_6 <- dataset_trec(split = "train")
test_6 <- dataset_trec(split = "test")

train_50 <- dataset_trec(split = "train", version = "50")
test_50 <- dataset_trec(split = "test", version = "50")

## End(Not run)

[Package textdata version 0.4.5 Index]