dataset_trec {textdata}R Documentation

TREC dataset

Description

The TREC dataset is dataset for question classification consisting of open-domain, fact-based questions divided into broad semantic categories. It has both a six-class (TREC-6) and a fifty-class (TREC-50) version. Both have 5,452 training examples and 500 test examples, but TREC-50 has finer-grained labels. Models are evaluated based on accuracy.

Usage

dataset_trec(
  dir = NULL,
  split = c("train", "test"),
  version = c("6", "50"),
  delete = FALSE,
  return_path = FALSE,
  clean = FALSE,
  manual_download = FALSE
)

Arguments

dir

Character, path to directory where data will be stored. If NULL, user_cache_dir will be used to determine path.

split

Character. Return training ("train") data or testing ("test") data. Defaults to "train".

version

Character. Version 6("6") or version 50("50"). Defaults to "6".

delete

Logical, set TRUE to delete dataset.

return_path

Logical, set TRUE to return the path of the dataset.

clean

Logical, set TRUE to remove intermediate files. This can greatly reduce the size. Defaults to FALSE.

manual_download

Logical, set TRUE if you have manually downloaded the file and placed it in the folder designated by running this function with return_path = TRUE.

Details

The classes in TREC-6 are

the classes in TREC-50 can be found here https://cogcomp.seas.upenn.edu/Data/QA/QC/definition.html.

Value

A tibble with 5,452 or 500 rows for "train" and "test" respectively and 2 variables:

class

Character, denoting the class

text

Character, question text

Source

https://cogcomp.seas.upenn.edu/Data/QA/QC/

https://trec.nist.gov/data/qa.html

See Also

Other topic: dataset_ag_news(), dataset_dbpedia()

Examples

## Not run: 
dataset_trec()

# Custom directory
dataset_trec(dir = "data/")

# Deleting dataset
dataset_trec(delete = TRUE)

# Returning filepath of data
dataset_trec(return_path = TRUE)

# Access both training and testing dataset
train_6 <- dataset_trec(split = "train")
test_6 <- dataset_trec(split = "test")

train_50 <- dataset_trec(split = "train", version = "50")
test_50 <- dataset_trec(split = "test", version = "50")

## End(Not run)


[Package textdata version 0.4.4 Index]