lgb.Dataset {lightgbm}R Documentation

Construct lgb.Dataset object

Description

LightGBM does not train on raw data. It discretizes continuous features into histogram bins, tries to combine categorical features, and automatically handles missing and

The Dataset class handles that preprocessing, and holds that alternative representation of the input data.

Usage

lgb.Dataset(
  data,
  params = list(),
  reference = NULL,
  colnames = NULL,
  categorical_feature = NULL,
  free_raw_data = TRUE,
  label = NULL,
  weight = NULL,
  group = NULL,
  init_score = NULL
)

Arguments

data

a matrix object, a dgCMatrix object, a character representing a path to a text file (CSV, TSV, or LibSVM), or a character representing a path to a binary lgb.Dataset file

params

a list of parameters. See The "Dataset Parameters" section of the documentation for a list of parameters and valid values.

reference

reference dataset. When LightGBM creates a Dataset, it does some preprocessing like binning continuous features into histograms. If you want to apply the same bin boundaries from an existing dataset to new data, pass that existing Dataset to this argument.

colnames

names of columns

categorical_feature

categorical features. This can either be a character vector of feature names or an integer vector with the indices of the features (e.g. c(1L, 10L) to say "the first and tenth columns").

free_raw_data

LightGBM constructs its data format, called a "Dataset", from tabular data. By default, that Dataset object on the R side does not keep a copy of the raw data. This reduces LightGBM's memory consumption, but it means that the Dataset object cannot be changed after it has been constructed. If you'd prefer to be able to change the Dataset object after construction, set free_raw_data = FALSE.

label

vector of labels to use as the target variable

weight

numeric vector of sample weights

group

used for learning-to-rank tasks. An integer vector describing how to group rows together as ordered results from the same set of candidate results to be ranked. For example, if you have a 100-document dataset with group = c(10, 20, 40, 10, 10, 10), that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.

init_score

initial score is the base prediction lightgbm will boost from

Value

constructed dataset

Examples




data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
data_file <- tempfile(fileext = ".data")
lgb.Dataset.save(dtrain, data_file)
dtrain <- lgb.Dataset(data_file)
lgb.Dataset.construct(dtrain)


[Package lightgbm version 4.5.0 Index]