R: Get a tuned XGBoost model fit

xgb.tuned {glmnetr}

R Documentation

Get a tuned XGBoost model fit

Description

This fits a gradient boosting machine model using the XGBoost platform. It uses the mlrMBO mlrMBO package to search for a well fitting set of hyperparameters and will generally provide a better fit than xgb.simple(). Both this program and xgb.simple() require data to be provided in a xgb.DMatrix() object. This object can be constructed with a command like data.full <- xgb.DMatrix( data=myxs, label=mylabel), where myxs object contains the predictors (features) in a numerical matrix format with no missing values, and mylabel is the outcome or dependent variable. For logistic regression this would typically be a vector of 0's and 1's. For linear regression this would be vector of numerical values. For a Cox proportional hazards model this would be in a format required for XGBoost, which is different than for the survival package or glmnet package. For the Cox model a vector is used where observations associated with an event are assigned the time of event, and observations associated with censoring are assigned the NEGATIVE of the time of censoring. In this way information about time and status are communicated in a single vector instead of two vectors. The xgb.tuned() function does not handle (start,stop) time, i.e. interval, data. To tune the xgboost model we use the mlrMBO package which "suggests" the DiceKriging and rgenoud packages, but doe not install these. Still, for xgb.tuned() to run it seems that one should install the DiceKriging and rgenoud packages.

Usage

xgb.tuned(
  train.xgb.dat,
  booster = "gbtree",
  objective = "survival:cox",
  eval_metric = NULL,
  minimize = NULL,
  seed = NULL,
  folds = NULL,
  doxgb = NULL,
  track = 0
)

Arguments

`train.xgb.dat`	The data to be used for training the XGBoost model
`booster`	for now just "gbtree" (default)
`objective`	one of "survival:cox" (default), "binary:logistic" or "reg:squarederror"
`eval_metric`	one of "cox-nloglik" (default), "auc" or "rmse",
`minimize`	whether the eval_metric is to be minimized or maximized
`seed`	a seed for set.seed() to assure one can get the same results twice. If NULL the program will generate a random seed. Whether specified or NULL, the seed is stored in the output object for future reference.
`folds`	an optional list where each element is a vector of indeces for a test fold. Default is NULL. If specified then nfold is ignored a la xgb.cv().
`doxgb`	A list specifying how the program is to do the xgb tune and fit. The list can have elements $nfold, $nrounds, and $early_stopping_rounds, each numerical values of length 1, $folds, a list as used by xgb.cv() do identify folds for cross validation, and $eta, $gamma, $max_depth, $min_child_seight, $colsample_bytree, $lambda, $alpha and $subsample, each a numeric of length 2 giving the lower and upper values for the respective tuning parameter. The meaning of these terms is as in 'xgboost' xgb.train(). If not provided defaults will be used. Defaults can be seen from the output object$doxgb element, again a list. In case not NULL, the seed and folds option values override the $seed and $folds values.
`track`	0 (default) to not track progress, 2 to track progress.

Value

a tuned XGBoost model fit

Author(s)

Walter K Kremers with contributions from Nicholas B Larson

Examples


# Simulate some data for a Cox model 
sim.data=glmnetr.simdata(nrows=1000, ncols=100, beta=NULL)
Surv.xgb = ifelse( sim.data$event==1, sim.data$yt, -sim.data$yt )
data.full <- xgboost::xgb.DMatrix(data = sim.data$xs, label = Surv.xgb)
# for this example we use a small number for folds_n and nrounds to shorten 
# run time.  This may still take a minute or so.  
# xgbfit=xgb.tuned(data.full,objective="survival:cox",nfold=5,nrounds=20)
# preds = predict(xgbfit, sim.data$xs)
# summary( preds )

[Package glmnetr version 0.5-2 Index]