R: Covariance Regression with Random Forests

covregrf {CovRegRF}

R Documentation

Covariance Regression with Random Forests

Description

Estimates the covariance matrix of a multivariate response given a set of covariates using a random forest framework.

Usage

covregrf(
  formula,
  data,
  params.rfsrc = list(ntree = 1000, mtry = ceiling(px/3), nsplit = max(round(n/50),
    10)),
  nodesize.set = round(0.5^(1:100) * sampsize)[round(0.5^(1:100) * sampsize) > py],
  importance = FALSE
)

Arguments

`formula`	Object of class `formula` or `character` describing the model to fit. Interaction terms are not supported.
`data`	The multivariate data set which has `n` observations and `px+py` variables where `px` and `py` are the number of covariates (`X`) and response variables (`Y`), respectively. Should be a data.frame.
`params.rfsrc`	List of parameters that should be passed to `randomForestSRC`. In the default parameter set, `ntree` = 1000, `mtry` = `px/3` (rounded up), `nsplit` = `max(round(n/50), 10)`. See `randomForestSRC` for possible parameters.
`nodesize.set`	The set of `nodesize` levels for tuning. Default set includes the power of two times the sub-sample size (`.632n`) greater than the number of response variables (`py`). See below for details of the `nodesize` tuning.
`importance`	Should variable importance of covariates be assessed? The default is `FALSE`.

Value

An object of class (covregrf, grow) which is a list with the following components:

`predicted.oob`	OOB predicted covariance matrices for training observations.
`importance`	Variable importance measures (VIMP) for covariates.
`best.nodesize`	Best `nodesize` value selected with the proposed tuning method.
`params.rfsrc`	List of parameters that was used to fit random forest with `randomForestSRC`.
`n`	Sample size of the data (`NA`'s are omitted).
`xvar.names`	A character vector of the covariate names.
`yvar.names`	A character vector of the response variable names.
`xvar`	Data frame of covariates.
`yvar`	Data frame of responses.
`rf.grow`	Fitted random forest object. This object is used for prediction with training or new data.

Details

For mean regression problems, random forests search for the optimal level of the nodesize parameter by using out-of-bag (OOB) prediction errors computed as the difference between the true responses and OOB predictions. The nodesize value having the smallest OOB prediction error is chosen. However, the covariance regression problem is unsupervised by nature. Therefore, we tune nodesize parameter with a heuristic method. We use OOB covariance matrix estimates. The general idea of the proposed tuning method is to find the nodesize level where the OOB covariance matrix predictions converge. The steps are as follows. Firstly, we train separate random forests for a set of nodesize values. Secondly, we compute the OOB covariance matrix estimates for each random forest. Next, we compute the mean absolute difference (MAD) between the upper triangular OOB covariance matrix estimates of two consecutive nodesize levels over all observations. Finally, we take the pair of nodesize levels having the smallest MAD. Among these two nodesize levels, we select the smaller since in general deeper trees are desired in random forests.

Examples

options(rf.cores=2, mc.cores=2)

## load generated example data
data(data, package = "CovRegRF")
xvar.names <- colnames(data$X)
yvar.names <- colnames(data$Y)
data1 <- data.frame(data$X, data$Y)

## define train/test split
set.seed(2345)
smp <- sample(1:nrow(data1), size = round(nrow(data1)*0.6), replace = FALSE)
traindata <- data1[smp,,drop=FALSE]
testdata <- data1[-smp, xvar.names, drop=FALSE]

## formula object
formula <- as.formula(paste(paste(yvar.names, collapse="+"), ".", sep=" ~ "))

## train covregrf
covregrf.obj <- covregrf(formula, traindata, params.rfsrc = list(ntree = 50),
  importance = TRUE)

## get the OOB predictions
pred.oob <- covregrf.obj$predicted.oob

## predict with new test data
pred.obj <- predict(covregrf.obj, newdata = testdata)
pred <- pred.obj$predicted

## get the variable importance measures
vimp <- covregrf.obj$importance

[Package CovRegRF version 2.0.1 Index]