boosting_diversity {Bodi}R Documentation

Diversity Boosting Algorithm

Description

Train a set of initial learners by promoting diversity among them. For this, a gradient descent strategy is adopted where a specialized loss function induces diversity which yields on a reduction of the mean-square-error of the aggregated learner.

Usage

boosting_diversity(
  target,
  cov,
  data0,
  data1,
  sample_size = 0.5,
  grad_step = 1,
  diversity_weight = 1,
  Nstep = 10,
  model = "gam",
  sampling = "random",
  Nblock = 10,
  aggregation_type = "uniform",
  param = list(),
  theorical_dw = FALSE,
  model_list = NULL,
  w_list = NULL,
  param_list = NULL,
  cov_list = NULL
)

Arguments

target

name of the target variable

cov

the model equation, a character string provided in the formula syntax. For example, for a linear model including covariates $X_1$ and $X_2$ it will be "X1+X2" and for a GAM with smooth effects it will be "s(X1)+s(X2)"

data0

the learning set

data1

the test set

sample_size

the size of the bootstrap sample as a proportion of the learning set size. sample_size=0.5 means that the resamples are of size n/2 where n is the number of rows of data0.

grad_step

step of the gradient descent

diversity_weight

the weight of the diversity encouraging penalty (kappa in the paper)

Nstep

the number of iterations of the diversity boosting algorithm ($N$ in the paper)

model

the type of base learner used in the algorithm if using a single base learner (model_list=NULL). Currently it could be either "gam" for an additive model, "rf" for a random forest, ""gbm" for gradient boosting machines, "rpart" for single CART trees.

sampling

the type of sampling procedure used in the resampling step. Could be either "random" for uniform random sampling with replacement or "blocks" for uniform sampling with replacement of blocks of consecutive data points. Default is "random".

Nblock

number of blocks for the block sampling. Equal to 10 by default.

aggregation_type

type of aggregation used for the ensemble method, default is uniform weights but it could be also "MLpol" an aggregation algorithm from the opera package

param

a list containing the parameters of the model chosen. It could be e.g. the number of trees for "rf", the depth of the tree for "rpart"...

theorical_dw

set to TRUE if one want to use the theoretical upper bound of the diversity weight kappa

model_list

a list of model among the possible ones (see the description of model argument). In that case the week learner is sample at each step in the list. "Still "experimental", be careful.

w_list

the prior weights of each model in the model_list

param_list

list of parameters of each model in the model_list

cov_list

list of covariates of each model in the model_list

Value

a list including the boosted models, the ensemble forecast

fitted_ensemble

Fitted values (in-sample predictions) for the ensemble method (matrix).

forecast_ensemble

Forecast (out-sample predictions) for the ensemble method (matrix).

fitted

Fitted values of the last boosting iteration (vector).

forecast

Forecast of the last boosting iteration (vector).

err_oob

Estimated out-of-bag errors by iteration (vector).

diversiy_oob

Estimated out-of-bag diversity (vector).

Author(s)

Yannig Goude <yannig.goude@edf.fr>

Examples

all <- na.omit(airquality)
smp <- sample(nrow(all), floor(.8 * nrow(all)))
boosting_diversity("Ozone", "Solar.R+Wind+Temp+Month+Day", 
                   data0 = all[smp, ], data1 = all[-smp, ])

[Package Bodi version 0.1.0 Index]