blockfor {blockForest} | R Documentation |
Random Forest variants for block-structured covariate data
Description
Implements five Random Forest variants for prediction
of binary, survival and metric outcomes using block-structured covariate
data, for example, clinical covariates plus measurements of a certain omics data type
or multi-omics data, that is, data for which measurements of different types of omics data
and/or clinical data for each patient exist. For example, for the task of predicting
survival for each patient there might be available
clinical covariates, gene expression measurements, mutation data,
and copy number variation measurements.
The group of covariates corresponding to one specific data type is denoted as a 'block'.
NOTE: We strongly recommend using the variant "BlockForest" (or "block forest")
in applications. The other four variants performed worse than "BlockForest"
in the analysis by Hornung & Wright (2019). Using 20 real multi-omics data sets Hornung & Wright (2019) compared all
five variants with each other and with alternatives, in particular with Random Survival Forest as existing
reference method. The ranking of the performances of the five variants was as follows
in the comparison study by Hornung & Wright (2019): 1) "BlockForest", 2) "RandomBlock",
3) "BlockVarSel", 4) "VarProb", 5) "SplitWeights".
Each of the five variants uses a different split selection algorithm.
For details, see Hornung & Wright (2019).
Note that this R package is a fork of the R package ranger.
NOTE ALSO: Including the clinical block mandatorily in the split point selection can considerably improve the prediction performance.
Whether or not this is the case, depends on the level of predictive information contained in the clinical block.
We recommend trying out including the clinical block mandatorily to see, whether this improves prediction
performance in the particular application. Note that in the case of including the clinical block mandatorily
and having more than only one omics block, "RandomBlock" performed (slightly) better than "BlockForest" in the comparison study by Hornung & Wright (2019).
Including the clinical block mandatorily can be performed by setting the function argument 'always.select.block'
of 'blockfor()' to the index of the clinical block (e.g., if the clinical block would be the second block in order, we would
set always.select.block=2).
Usage
blockfor(
X,
y,
blocks,
block.method = "BlockForest",
num.trees = 2000,
mtry = NULL,
nsets = 300,
num.trees.pre = 1500,
splitrule = "extratrees",
always.select.block = 0,
...
)
Arguments
X |
Covariate matrix. observations in rows, variables in columns. |
y |
Target variable. If the outcome is binary, this is a factor with two levels. If the outcome is metric, this is a numeric vector. If the outcome is a survival outcome, this is a matrix with two columns, where the first column contains the vector of survival/censoring times (one for each observation) and the second column contains the status variable, that has the value '1' if the corresponding time is a survival time and '0' if that time is a censoring time. |
blocks |
A list of length equal to the number M of blocks considered. Each entry contains the vector of column indices in 'X' of the covariates in one of the M blocks. |
block.method |
Forest variant to use. One of the following: "BlockForest" (default), "RandomBlock", "BlockVarSel", "VarProb", "SplitWeights". The latter listing is ordered according to the performances of these variants in the comparison study by Hornung & Wright (2019), with the best variant being listed first. |
num.trees |
Number of trees in the forest. |
mtry |
This is either a number specifying the number of variables sampled for each split from all variables (for variants "VarProb" and "SplitWeights") or a vector of length equal to the number of blocks, where the m-th entry of the vector gives the number of variables to sample from block m (for variants "BlockForest", "RandomBlock", and "BlockVarSel"). The default values are sqrt(p_1) + sqrt(p_2) + ... sqrt(p_M) and (sqrt(p_1), sqrt(p_2), ..., sqrt(p_M)), respectively, where p_m denotes the number of variables in the m-th block (m = 1, ..., M) and sqrt() denoted the square root function. |
nsets |
Number of sets of tuning parameter values generated randomly in the optimization of the tuning parameters. Each variant has a tuning parameter for each block, that is, there are M tuning parameters for each variant. These tuning parameters are optimized in the following way: 1. Generate random sets of tuning parameter values and measure there adequateness: For j = 1,..., nsets: a) Generate a random set of tuning parameter values; b) Construct a forest (with num.trees.pre trees) using the set of tuning parameter values generated in a); c) Record the out-of-bag (OOB) estimated prediction error of the forest constructed in b); 2. Use the set of tuning parameter values generated in 1. that is associated with the smallest OOB estimated prediction error. |
num.trees.pre |
Number of trees in each forest constructed during the optimization of the tuning parameter values, see 'nsets' for details. |
splitrule |
Splitting rule. Default "extratrees" (for computational efficiency). For other options see |
always.select.block |
Number of block to make always available for splitting (e.g. clinical covariates). |
... |
Parameters passed to |
Value
blockfor
returns a list containing the following components:
forest |
object of class |
paramvalues |
vector of length M. Optimized tuning parameter value for each block. |
biased_oob_error_donotuse |
numeric. OOB estimated prediction error. NOTE: This estimate should not be used, because it is (highly) optimistic (i.e, too small), because the data set was used twice - for optimizing the tuning parameter values and for estimating the prediction error. Instead, cross-validation should be used to estimate the prediction error. |
Author(s)
Roman Hornung, Marvin N. Wright
References
Hornung, R. & Wright, M. N. (2019) Block Forests: random forests for blocks of clinical and omics covariate data. BMC Bioinformatics 20:358. doi:10.1186/s12859-019-2942-y.
Breiman, L. (2001). Random forests. Mach Learn, 45(1), 5-32. doi:10.1023/A:1010933404324.
Wright, M. N. & Ziegler, A. (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. J Stat Softw 77:1-17. doi:10.18637/jss.v077.i01.
Examples
# NOTE: There is no association between covariates and response for the
# simulated data below.
# Moreover, the input parameters of blockfor() are highly unrealistic
# (e.g., nsets = 10 is specified much too small).
# The purpose of the shown examples is merely to illustrate the
# application of blockfor().
# Generate data:
################
set.seed(1234)
# Covariate matrix:
X <- cbind(matrix(nrow=40, ncol=5, data=rnorm(40*5)),
matrix(nrow=40, ncol=30, data=rnorm(40*30, mean=1, sd=2)),
matrix(nrow=40, ncol=100, data=rnorm(40*100, mean=2, sd=3)))
# Block variable (list):
blocks <- rep(1:3, times=c(5, 30, 100))
blocks <- lapply(1:3, function(x) which(blocks==x))
# Binary outcome:
ybin <- factor(sample(c(0,1), size=40, replace=TRUE), levels=c(0,1))
# Survival outcome:
ysurv <- cbind(rnorm(40), sample(c(0,1), size=40, replace=TRUE))
# Application to binary outcome:
################################
blockforobj <- blockfor(X, ybin, num.trees = 100, replace = TRUE, blocks=blocks,
nsets = 10, num.trees.pre = 50, splitrule="extratrees",
block.method = "BlockForest")
# Tuning parameter estimates (see Hornung & Wright (2019)):
blockforobj$paramvalues
# Application to survival outcome:
##################################
blockforobj <- blockfor(X, ysurv, num.trees = 100, replace = TRUE, blocks=blocks,
nsets = 10, num.trees.pre = 50, splitrule="extratrees",
block.method = "BlockForest")
blockforobj$paramvalues