rfpi {RFpredInterval}R Documentation

Prediction intervals with random forests

Description

Constructs prediction intervals with 15 distinct variations proposed by Roy and Larocque (2020). The variations include two aspects: The method used to build the forest and the method used to build the prediction interval. There are three methods to build the forest, (i) least-squares (LS), (ii) L1 and (iii) shortest prediction interval (SPI) from the CART paradigm. There are five methods for constructing prediction intervals, classical method, shortest prediction interval, quantile method, highest density region, and contiguous HDR.

Usage

rfpi(
  formula,
  traindata,
  testdata,
  alpha = 0.05,
  split_rule = c("ls", "l1", "spi"),
  pi_method = c("lm", "spi", "quant", "hdr", "chdr"),
  calibration = TRUE,
  rf_package = c("rfsrc", "ranger"),
  params_rfsrc = list(ntree = 2000, mtry = ceiling(px/3), nodesize = 5, samptype =
    "swr"),
  params_ranger = list(num.trees = 2000, mtry = ceiling(px/3), min.node.size = 5,
    replace = TRUE),
  params_calib = list(range = c(1 - alpha - 0.005, 1 - alpha + 0.005), start = (1 -
    alpha), step = 0.01, refine = TRUE),
  oob = FALSE
)

Arguments

formula

Object of class formula or character describing the model to fit.

traindata

Training data of class data.frame.

testdata

Test data of class data.frame.

alpha

Confidence level. (1 - alpha) is the desired coverage level. The default is alpha = 0.05 for the 95% prediction interval.

split_rule

Split rule for building a forest. Options are "ls" for CART with least-squares (LS) splitting rule, "l1" for CART with L1 splitting rule, "spi" for CART with shortest prediction interval (SPI) splitting rule. The default is "ls".

pi_method

Methods for building a prediction interval. Options are "lm" for classical method, "spi" for shortest prediction interval, "quant" for quantile method, "hdr" for highest density region, and "chdr" for contiguous HDR. The default is to use all methods for PI construction. Single method or a subset of methods can be applied.

calibration

Apply OOB calibration for finding working level of alpha, i.e. \alpha_w. See below for details. The default is TRUE.

rf_package

Random forest package that can be used for RF training. Options are "rfsrc" for randomForestSRC and "ranger" for ranger packages. Split rule "ls" can be used with both packages. However, "l1" and "spi" split rules can only be used with "rfsrc". The default is "rfsrc".

params_rfsrc

List of parameters that should be passed to randomForestSRC. In the default parameter set, ntree = 2000, mtry = px/3 (rounded up), nodesize = 5, samptype = "swr". See randomForestSRC for possible parameters.

params_ranger

List of parameters that should be passed to ranger. In the default parameter set, num.trees = 2000, mtry = px/3 (rounded up), min.node.size = 5, replace = TRUE. See ranger for possible parameters.

params_calib

List of parameters for calibration procedure. range is the allowed target calibration range for coverage level. The value that provides a coverage level within the range is chosen as \alpha_w. start is the initial coverage level to start calibration procedure. step is the coverage step size for each calibration iteration. refine is the gradual decrease in step value when close to target coverage level, the default is TRUE which allows gradual decrease.

oob

Should out-of-bag (OOB) predictions and prediction intervals for the training observations be returned?

Value

A list with the following components:

lm_interval

Prediction intervals for test data with the classical method. A list containing lower and upper bounds.

spi_interval

Prediction intervals for test data with SPI method. A list containing lower and upper bounds.

hdr_interval

Prediction intervals for test data with HDR method. A list containing lower and upper bounds of prediction interval for each test observation. There may be multiple PIs for a single observation.

chdr_interval

Prediction intervals for test data with contiguous HDR method. A list containing lower and upper bounds.

quant_interval

Prediction intervals for test data with quantiles method. A list containing lower and upper bounds.

test_pred

Random forest predictions for test data.

test_response

If available, test response.

alphaw

Working level of alpha, i.e. \alpha_w. A numeric array for the PI methods entered with pi_method. If calibration = FALSE, it returns NULL.

split_rule

Split rule used for building the random forest.

rf_package

Random forest package that was used for RF training.

oob_pred_interval

Out-of-bag (OOB) prediction intervals for train data. Prediction intervals are built with alpha. If oob = FALSE, it returns NULL.

oob_pred

Out-of-bag (OOB) predictions for train data. If oob = FALSE, it returns NULL.

train_response

Train response.

Details

Calibration process

The calibration procedure uses the "Bag of Observations for Prediction" (BOP) idea. BOP for a new observation is built with the set inbag observations that are in the same terminal nodes as the new observation. The calibration procedure uses the BOPs constructed for the training observations. BOP for a training observation is built using only the trees where this training observation is out-of-bag (OOB).

Let (1-\alpha) be the target coverage level. The goal of the calibration is to find the value of \alpha_w, which is the working level of \alpha called by Roy and Larocque (2020), such that the coverage level of the prediction intervals for the training observations is closest to the target coverage level. The idea is to find the value of \alpha_w using the OOB-BOPs. Once found, (1-\alpha_w) becomes the level used to build the prediction intervals for the new observations.

References

Roy, M. H., & Larocque, D. (2020). Prediction intervals with random forests. Statistical methods in medical research, 29(1), 205-229. doi:10.1177/0962280219829885.

See Also

piall pibf print.rfpredinterval

Examples


## load example data
data(BostonHousing, package = "RFpredInterval")
set.seed(2345)

## define train/test split
testindex <- 1:10
trainindex <- sample(11:nrow(BostonHousing), size = 100, replace = FALSE)
traindata <- BostonHousing[trainindex, ]
testdata <- BostonHousing[testindex, ]
px <- ncol(BostonHousing) - 1

## contruct 90% PI with "l1" split rule and "spi" PI method with calibration
out <- rfpi(formula = medv ~ ., traindata = traindata,
  testdata = testdata, alpha = 0.1, calibration = TRUE,
  split_rule = "l1", pi_method = "spi", params_rfsrc = list(ntree = 50),
  params_calib = list(range = c(0.89, 0.91), start = 0.9, step = 0.01,
  refine = TRUE))

## get the PI with "spi" method for first observation in the testdata
c(out$spi_interval$lower[1], out$spi_interval$upper[1])

## get the random forest predictions for testdata
out$test_pred

## get the working level of alpha (alphaw)
out$alphaw

## contruct 95% PI with "ls" split rule, "lm" and "quant" PI methods
## with calibration and use "ranger" package for RF training
out2 <- rfpi(formula = medv ~ ., traindata = traindata,
  testdata = testdata, split_rule = "ls", pi_method = c("lm", "quant"),
  rf_package = "ranger", params_ranger = list(num.trees = 50))

## get the PI with "quant" method for the testdata
cbind(out2$quant_interval$lower, out2$quant_interval$upper)



[Package RFpredInterval version 1.0.8 Index]