run_gb {autoMrP}R Documentation

Apply gradient boosting classifier to MrP.

Description

run_gb is a wrapper function that applies the gradient boosting classifier to data provided by the user, evaluates prediction performance, and chooses the best-performing model.

Usage

run_gb(
  y,
  L1.x,
  L2.x,
  L2.eval.unit,
  L2.unit,
  L2.reg,
  loss.unit,
  loss.fun,
  interaction.depth,
  shrinkage,
  n.trees.init,
  n.trees.increase,
  n.trees.max,
  cores = cores,
  n.minobsinnode,
  data,
  verbose
)

Arguments

y

Outcome variable. A character vector containing the column names of the outcome variable. A character scalar containing the column name of the outcome variable in survey.

L1.x

Individual-level covariates. A character vector containing the column names of the individual-level variables in survey and census used to predict outcome y. Note that geographic unit is specified in argument L2.unit.

L2.x

Context-level covariates. A character vector containing the column names of the context-level variables in survey and census used to predict outcome y.

L2.eval.unit

Geographic unit. A character scalar containing the column name of the geographic unit in survey and census at which outcomes should be aggregated.

L2.unit

Geographic unit. A character scalar containing the column name of the geographic unit in survey and census at which outcomes should be aggregated.

L2.reg

Geographic region. A character scalar containing the column name of the geographic region in survey and census by which geographic units are grouped (L2.unit must be nested within L2.reg). Default is NULL.

loss.unit

Loss function unit. A character-valued scalar indicating whether performance loss should be evaluated at the level of individual respondents (individuals), geographic units (L2 units) or at both levels. Default is c("individuals", "L2 units"). With multiple loss units, parameters are ranked for each loss unit and the loss unit with the lowest rank sum is chosen. Ties are broken according to the order in the search grid.

loss.fun

Loss function. A character-valued scalar indicating whether prediction loss should be measured by the mean squared error (MSE), the mean absolute error (MAE), binary cross-entropy (cross-entropy), mean squared false error (msfe), the f1 score (f1), or a combination thereof. Default is c("MSE", "cross-entropy","msfe", "f1"). With multiple loss functions, parameters are ranked for each loss function and the parameter combination with the lowest rank sum is chosen. Ties are broken according to the order in the search grid.

interaction.depth

GB interaction depth. An integer-valued vector whose values specify the interaction depth of GB. The interaction depth defines the maximum depth of each tree grown (i.e., the maximum level of variable interactions). Default is c(1, 2, 3).

shrinkage

GB learning rate. A numeric vector whose values specify the learning rate or step-size reduction of GB. Values between 0.001 and 0.1 usually work, but a smaller learning rate typically requires more trees. Default is c(0.04, 0.01, 0.008, 0.005, 0.001).

n.trees.init

GB initial total number of trees. An integer-valued scalar specifying the initial number of total trees to fit by GB. Default is 50.

n.trees.increase

GB increase in total number of trees. An integer-valued scalar specifying by how many trees the total number of trees to fit should be increased (until n.trees.max is reached) or an integer-valued vector of length length(shrinkage) with each of its values being associated with a learning rate in shrinkage. Default is 50.

n.trees.max

GB maximum number of trees. An integer-valued scalar specifying the maximum number of trees to fit by GB or an integer-valued vector of length length(shrinkage) with each of its values being associated with a learning rate and an increase in the total number of trees. Default is 1000.

cores

The number of cores to be used. An integer indicating the number of processor cores used for parallel computing. Default is 1.

n.minobsinnode

GB minimum number of observations in the terminal nodes. An integer-valued scalar specifying the minimum number of observations that each terminal node of the trees must contain. Default is 5.

data

Data for cross-validation. A list of k data.frames, one for each fold to be used in k-fold cross-validation.

verbose

Verbose output. A logical argument indicating whether or not verbose output should be printed. Default is TRUE.

Value

The tuned gradient boosting parameters. A list with three elements: interaction_depth contains the interaction depth parameter, shrinkage contains the learning rate, n_trees the number of trees to be grown.


[Package autoMrP version 0.98 Index]