gpb.grid.search.tune.parameters {gpboost} | R Documentation |
Function for choosing tuning parameters
Description
Function that allows for choosing tuning parameters from a grid in a determinstic or random way using cross validation or validation data sets.
Usage
gpb.grid.search.tune.parameters(param_grid, data, params = list(),
num_try_random = NULL, nrounds = 100L, gp_model = NULL,
line_search_step_length = FALSE, use_gp_model_for_validation = TRUE,
train_gp_model_cov_pars = TRUE, folds = NULL, nfold = 4L,
label = NULL, weight = NULL, obj = NULL, eval = NULL,
verbose_eval = 1L, stratified = TRUE, init_model = NULL,
colnames = NULL, categorical_feature = NULL,
early_stopping_rounds = NULL, callbacks = list(),
return_all_combinations = FALSE, ...)
Arguments
param_grid |
|
data |
a |
params |
|
num_try_random |
|
nrounds |
number of boosting iterations (= number of trees). This is the most important tuning parameter for boosting |
gp_model |
A |
line_search_step_length |
Boolean. If TRUE, a line search is done to find the optimal step length for every boosting update
(see, e.g., Friedman 2001). This is then multiplied by the |
use_gp_model_for_validation |
Boolean. If TRUE, the |
train_gp_model_cov_pars |
Boolean. If TRUE, the covariance parameters
of the |
folds |
|
nfold |
the original dataset is randomly partitioned into |
label |
Vector of labels, used if |
weight |
vector of response values. If not NULL, will set to dataset |
obj |
(character) The distribution of the response variable (=label) conditional on fixed and random effects. This only needs to be set when doing independent boosting without random effects / Gaussian processes. |
eval |
Evaluation metric to be monitored when doing CV and parameter tuning. This can be a string, function, or list with a mixture of strings and functions.
|
verbose_eval |
|
stratified |
a |
init_model |
path of model file of |
colnames |
feature names, if not null, will use this to overwrite the names in dataset |
categorical_feature |
categorical features. This can either be a character vector of feature
names or an integer vector with the indices of the features (e.g.
|
early_stopping_rounds |
int. Activates early stopping. Requires at least one validation data
and one metric. When this parameter is non-null,
training will stop if the evaluation of any metric on any validation set
fails to improve for |
callbacks |
List of callback functions that are applied at each iteration. |
return_all_combinations |
a |
... |
other parameters, see Parameters.rst for more information. |
Value
A list
with the best parameter combination and score
The list has the following format:
list("best_params" = best_params, "best_iter" = best_iter, "best_score" = best_score)
If return_all_combinations is TRUE, then the list contains an additional entry 'all_combinations'
Early Stopping
"early stopping" refers to stopping the training process if the model's performance on a given validation set does not improve for several consecutive iterations.
If multiple arguments are given to eval
, their order will be preserved. If you enable
early stopping by setting early_stopping_rounds
in params
, by default all
metrics will be considered for early stopping.
If you want to only consider the first metric for early stopping, pass
first_metric_only = TRUE
in params
. Note that if you also specify metric
in params
, that metric will be considered the "first" one. If you omit metric
,
a default metric will be used based on your choice for the parameter obj
(keyword argument)
or objective
(passed into params
).
Author(s)
Fabio Sigrist
Examples
# See https://github.com/fabsig/GPBoost/tree/master/R-package for more examples
library(gpboost)
data(GPBoost_data, package = "gpboost")
# Create random effects model, dataset, and define parameter grid
gp_model <- GPModel(group_data = group_data[,1], likelihood="gaussian")
dataset <- gpb.Dataset(X, label = y)
param_grid = list("learning_rate" = c(1,0.1,0.01),
"min_data_in_leaf" = c(10,100,1000),
"max_depth" = c(1,2,3,5,10),
"lambda_l2" = c(0,1,10))
other_params <- list(num_leaves = 2^10)
# Note: here we try different values for 'max_depth' and thus set 'num_leaves' to a large value.
# An alternative strategy is to impose no limit on 'max_depth',
# and try different values for 'num_leaves' as follows:
# param_grid = list("learning_rate" = c(1,0.1,0.01),
# "min_data_in_leaf" = c(10,100,1000),
# "num_leaves" = 2^(1:10),
# "lambda_l2" = c(0,1,10))
# other_params <- list(max_depth = -1)
set.seed(1)
opt_params <- gpb.grid.search.tune.parameters(param_grid = param_grid, params = other_params,
num_try_random = NULL, nfold = 4,
data = dataset, gp_model = gp_model,
use_gp_model_for_validation=TRUE, verbose_eval = 1,
nrounds = 1000, early_stopping_rounds = 10)
print(paste0("Best parameters: ",
paste0(unlist(lapply(seq_along(opt_params$best_params),
function(y, n, i) { paste0(n[[i]],": ", y[[i]]) },
y=opt_params$best_params,
n=names(opt_params$best_params))), collapse=", ")))
print(paste0("Best number of iterations: ", opt_params$best_iter))
print(paste0("Best score: ", round(opt_params$best_score, digits=3)))
# Note: other scoring / evaluation metrics can be chosen using the
# 'metric' argument, e.g., metric = "l1"
# Using manually defined validation data instead of cross-validation
valid_tune_idx <- sample.int(length(y), as.integer(0.2*length(y)))
folds = list(valid_tune_idx)
opt_params <- gpb.grid.search.tune.parameters(param_grid = param_grid, params = other_params,
num_try_random = NULL, folds = folds,
data = dataset, gp_model = gp_model,
use_gp_model_for_validation=TRUE, verbose_eval = 1,
nrounds = 1000, early_stopping_rounds = 10)