R: (Adaptive) Boosting Trees (ABT/BT) Algorithm.

BT {BT}

R Documentation

(Adaptive) Boosting Trees (ABT/BT) Algorithm.

Description

Performs the (Adaptive) Boosting Trees algorithm. This code prepares the inputs and calls the function BT_call. Each tree in the process is built thanks to the rpart function. In case of cross-validation, this function prepares the folds and performs multiple calls to the fitting function BT_call.

Usage

BT(
  formula = formula(data),
  data = list(),
  tweedie.power = 1,
  ABT = TRUE,
  n.iter = 100,
  train.fraction = 1,
  interaction.depth = 4,
  shrinkage = 1,
  bag.fraction = 1,
  colsample.bytree = NULL,
  keep.data = TRUE,
  is.verbose = FALSE,
  cv.folds = 1,
  folds.id = NULL,
  n.cores = 1,
  tree.control = rpart.control(xval = 0, maxdepth = (if (!is.null(interaction.depth)) {
 
       interaction.depth
 } else {
     10
 }), cp = -Inf, minsplit = 2),
  weights = NULL,
  seed = NULL,
  ...
)

Arguments

`formula`	a symbolic description of the model to be fit. Note that the offset isn't supported in this algorithm. Instead, everything is performed with a log-link function and a direct relationship exist between response, offset and weights.
`data`	an optional data frame containing the variables in the model. By default the variables are taken from `environment(formula)`, typically the environment from which `BT` is called. If `keep.data=TRUE` in the initial call to `BT` then `BT` stores a copy with the object (up to the variables used).
`tweedie.power`	Experimental parameter currently not used - Set to 1 referring to Poisson distribution.
`ABT`	a boolean parameter. If `ABT=TRUE` an adaptive boosting tree algorithm is built whereas if `ABT=FALSE` an usual boosting tree algorithm is run. By default, it is set to `TRUE`.
`n.iter`	the total number of iterations to fit. This is equivalent to the number of trees and the number of basis functions in the additive expansion. Please note that the initialization is not taken into account in the `n.iter`. More explicitly, a weighted average initializes the algorithm and then `n.iter` trees are built. Moreover, note that the `bag.fraction`, `colsample.bytree`, ... are not used for this initializing phase. By default, it is set to 100.
`train.fraction`	the first `train.fraction * nrows(data)` observations are used to fit the `BT` and the remainder are used for computing out-of-sample estimates (also known as validation error) of the loss function. By default, it is set to 1 meaning no out-of-sample estimates.
`interaction.depth`	the maximum depth of variable interactions: 1 builds an additive model, 2 builds a model with up to two-way interactions, etc. This parameter can also be interpreted as the maximum number of non-terminal nodes. By default, it is set to 4. Please note that if this parameter is `NULL`, all the trees in the expansion are built based on the `tree.control` parameter only, independently of the `ABT` value. This option is devoted to advanced users only and allows them to benefit from the full flexibility of the implemented algorithm.
`shrinkage`	a shrinkage parameter (in the interval (0,1]) applied to each tree in the expansion. Also known as the learning rate or step-size reduction. By default, it is set to 1.
`bag.fraction`	the fraction of independent training observations randomly selected to propose the next tree in the expansion. This introduces randomness into the model fit. If `bag.fraction`<1 then running the same model twice will result in similar but different fits. Please note that if this parameter is used the `BTErrors$training.error` corresponds to the normalized in-bag error and the out-of-bag improvements are computed and stored in `BTErrors$oob.improvement`. See `BTFit` for more details. By default, it is set to 1.
`colsample.bytree`	each tree will be trained on a random subset of `colsample.bytree` number of features. Each tree will consider a new random subset of features from the formula, adding variability to the algorithm and reducing computation time. `colsample.bytree` will be bounded between 1 and the number of features considered in the formula. By default, it is set to `NULL` meaning no effect.
`keep.data`	a boolean variable indicating whether to keep the data frames. This is particularly useful if one wants to keep track of the initial data frames and is further used for predicting in case any data frame is specified. Note that in case of cross-validation, if `keep.data=TRUE` the initial data frames are saved whereas the cross-validation samples are not. By default, it is set to `FALSE`.
`is.verbose`	if `is.verbose=TRUE`, the `BT` will print out the algorithm progress. By default, it is set to `FALSE`.
`cv.folds`	a positive integer representing the number of cross-validation folds to perform. If `cv.folds`>1 then `BT`, in addition to the usual fit, will perform a cross-validation and calculate an estimate of generalization error returned in `BTErrors$cv.error`. By default, it is set to 1 meaning no cross-validation.
`folds.id`	an optional vector of values identifying what fold each observation is in. If supplied, this parameter prevails over `cv.folds`. By default, `folds.id = NULL` meaning that no folds are defined.
`n.cores`	the number of cores to use for parallelization. This parameter is used during the cross-validation. This parameter is bounded between 1 and the maximum number of available cores. By default, it is set to 1 leading to a sequential approach.
`tree.control`	for advanced user only. It allows to define additional tree parameters that will be used at each iteration. See `rpart.control` for more information.
`weights`	optional vector of weights used in the fitting process. These weights must be positive but do not need to be normalized. By default, it is set to `NULL` which corresponds to an uniform weight of 1 for each observation.
`seed`	optional number used as seed. Please note that if `cv.folds`>1, the `parLapply` function is called. Therefore, the seed (if defined) used inside each fold will be a multiple of the `seed` parameter.
`...`	not currently used.

Details

The NA values are currently dropped using na.omit.

Value

a BTFit object.

Author(s)

Gireg Willame gireg.willame@gmail.com

This package is inspired by the gbm3 package. For more details, see https://github.com/gbm-developers/gbm3/.

References

M. Denuit, D. Hainaut and J. Trufin (2019). Effective Statistical Learning Methods for Actuaries |: GLMs and Extensions, Springer Actuarial.

M. Denuit, D. Hainaut and J. Trufin (2019). Effective Statistical Learning Methods for Actuaries ||: Tree-Based Methods and Extensions, Springer Actuarial.

M. Denuit, D. Hainaut and J. Trufin (2019). Effective Statistical Learning Methods for Actuaries |||: Neural Networks and Extensions, Springer Actuarial.

M. Denuit, D. Hainaut and J. Trufin (2022). Response versus gradient boosting trees, GLMs and neural networks under Tweedie loss and log-link. Accepted for publication in Scandinavian Actuarial Journal.

M. Denuit, J. Huyghe and J. Trufin (2022). Boosting cost-complexity pruned trees on Tweedie responses: The ABT machine for insurance ratemaking. Paper submitted for publication.

M. Denuit, J. Trufin and T. Verdebout (2022). Boosting on the responses with Tweedie loss functions. Paper submitted for publication.

Examples


## Load dataset.
dataset <- BT::BT_Simulated_Data

## Fit a Boosting Tree model.
BT_algo <- BT(formula = Y_normalized ~ Age + Sport + Split + Gender, # formula
              data = dataset, # data
              ABT = FALSE, # Classical Boosting Tree
              n.iter = 200,
              train.fraction = 0.8,
              interaction.depth = 3,
              shrinkage = 0.01,
              bag.fraction = 0.5,
              colsample.bytree = 2, # 2 explanatory variable used at each iteration.
              keep.data = FALSE, # Do not keep a data copy.
              is.verbose = FALSE, # Do not print progress.
              cv.folds = 3, # 3-cv will be performed.
              folds.id = NULL ,
              n.cores = 1,
              weights = ExpoR, # <=> Poisson model on response Y with ExpoR in offset.
              seed = NULL)

## Determine the model performance and plot results.
best_iter_val <- BT_perf(BT_algo, method='validation')
best_iter_oob <- BT_perf(BT_algo, method='OOB', oobag.curve = TRUE)
best_iter_cv <- BT_perf(BT_algo, method ='cv', oobag.curve = TRUE)

best_iter <- best_iter_val

## Variable influence and plot results.
# Based on the first iteration.
variable_influence1 <- summary(BT_algo, n.iter = 1)
# Using all iterations up to best_iter.
variable_influence_best_iter <- summary(BT_algo, n.iter = best_iter)

##  Print results : call, best_iters and summarized relative influence.
print(BT_algo)

## Model predictions.
# Predict on the link scale, using only the best_iter tree.
pred_single_iter <- predict(BT_algo, newdata = dataset,
                            n.iter = best_iter, type = 'link', single.iter = TRUE)
# Predict on the response scale, using the first best_iter.
pred_best_iter <- predict(BT_algo, newdata = dataset,
                          n.iter = best_iter, type = 'response')

[Package BT version 0.4 Index]