interactionfor {diversityForest} | R Documentation |
Construct an interaction forest prediction rule and calculate EIM values as described in Hornung & Boulesteix (2022).
Description
Implements interaction forests as described in Hornung & Boulesteix (2022). Currently, categorical, metric, and survival outcomes are supported. Interaction forests feature the effect importance measure (EIM), which can be used to rank the covariate variable pairs with respect to the impact of their interaction effects on prediction. This allows to identify relevant interaction effects. Interaction forests focus on well interpretable interaction effects. See the 'Details' section below for more details. In addition, we strongly recommend to consult Section C of Supplementary Material 1 of Hornung & Boulesteix (2022), which uses detailed examples of interaction forest analyses with code to illustrate how interaction forests can be used in applications: Link.
Usage
interactionfor(
formula = NULL,
data = NULL,
importance = "both",
num.trees = NULL,
simplify.large.n = TRUE,
num.trees.eim.large.n = NULL,
write.forest = TRUE,
probability = FALSE,
min.node.size = NULL,
max.depth = NULL,
replace = FALSE,
sample.fraction = ifelse(replace, 1, 0.7),
case.weights = NULL,
class.weights = NULL,
splitrule = NULL,
always.split.variables = NULL,
keep.inbag = FALSE,
inbag = NULL,
holdout = FALSE,
quantreg = FALSE,
oob.error = TRUE,
num.threads = NULL,
verbose = TRUE,
seed = NULL,
dependent.variable.name = NULL,
status.variable.name = NULL,
npairs = NULL,
classification = NULL
)
Arguments
formula |
Object of class |
data |
Training data of class |
importance |
Effect importance mode. One of the following: "both" (the default), "qualitative", "quantitative", "mainonly", "none". See the 'Details' section below for explanation. |
num.trees |
Number of trees. The default number is 20000, if EIM values should be computed
and 2000 otherwise. Note that if |
simplify.large.n |
Should restricted tree depths be used, when calculating EIM values for large data sets? See the 'Details' section below for more information. Default is |
num.trees.eim.large.n |
Number of trees in the forest used for calculating the EIM values for large data sets.
If |
write.forest |
Save |
probability |
Grow a probability forest as in Malley et al. (2012). |
min.node.size |
Minimal node size. Default 1 for classification, 5 for regression, 3 for survival, and 5 for probability. |
max.depth |
Maximal tree depth. A value of NULL or 0 (the default) corresponds to unlimited depth, 1 to tree stumps (1 split per tree). |
replace |
Sample with replacement. Default is |
sample.fraction |
Fraction of observations to sample. Default is 1 for sampling with replacement and 0.7 for sampling without replacement. For classification, this can be a vector of class-specific values. |
case.weights |
Weights for sampling of training observations. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples for the trees. |
class.weights |
Weights for the outcome classes (in order of the factor levels) in the splitting rule (cost sensitive learning). Classification and probability prediction only. For classification the weights are also applied in the majority vote in terminal nodes. |
splitrule |
Splitting rule. For classification and probability estimation "gini" or "extratrees" with default "gini". For regression "variance", "extratrees" or "maxstat" with default "variance". For survival "logrank", "extratrees", "C" or "maxstat" with default "logrank". NOTE: For interaction forests currently only the default splitting rules are supported. |
always.split.variables |
Currently not useable. Character vector with variable names to be always selected. |
keep.inbag |
Save how often observations are in-bag in each tree. |
inbag |
Manually set observations per tree. List of size num.trees, containing inbag counts for each observation. Can be used for stratified sampling. |
holdout |
Hold-out mode. Hold-out all samples with case weight 0 and use these for variable importance and prediction error. NOTE: Currently, not useable for interaction forests. |
quantreg |
Prepare quantile prediction as in quantile regression forests (Meinshausen 2006). Regression only. Set |
oob.error |
Compute OOB prediction error. Set to |
num.threads |
Number of threads. Default is number of CPUs available. |
verbose |
Show computation status and estimated runtime. |
seed |
Random seed. Default is |
dependent.variable.name |
Name of outcome variable, needed if no formula given. For survival forests this is the time variable. |
status.variable.name |
Name of status variable, only applicable to survival data and needed if no formula given. Use 1 for event and 0 for censoring. |
npairs |
Number of variable pairs to sample for each split. Default is the square root of the number of independent variables divided by 2 (this number is rounded up). |
classification |
Only needed if data is a matrix. Set to |
Details
The effect importance measure (EIM) of interaction forests distinguishes quantitative and qualitative interaction effects (Peto, 1982).
This is a common distinction as these two types of interaction effects are interpreted in different ways (see below).
For both of these types, EIM values for each variable pair are obtained: the quantitative and qualitative EIM values.
Interaction forests target easily interpretable types of interaction effects. These can be communicated clearly using statements
of the following kind: "The strength of the positive (negative) effect of variable A on the outcome depends on the level of variable B"
for quantitative interactions, and "for observations with small values of variable B, the effect of variable A is positive (negative),
but for observations with large values of B, the effect of A is negative (positive)" for qualitative interactions.
In addition to calculating EIM values for variable pairs, importance values for the individual variables are calculated as well, the univariable
EIM values. These measure the variable importance as in the case of classical variable importance measures of random forests.
The effect importance mode can be set via the importance
argument: "qualitative"
: Calculate only qualitative EIM values;
"quantitative"
: Calculate only quantitative EIM values; "both"
(the default): Calculate qualitative and quantitative EIM
values; "mainonly"
: Calculate only univariable EIM values.
The top variable pairs with largest quantitative and qualitative EIM values likely have quantitative and qualitative interactions,
respectively, which have a considerable impact on prediction. The top variables with largest univariable EIM values likely have a considerable
impact on prediction. Note that it is currently not possible to test the EIM values for
statistical significance using the interaction forests algorithm itself. However, the p-values
shown in the plots obtained with plotEffects
(which are obtained using bivariable
models) can be adjusted for multiple testing using the Bonferroni procedure to obtain
practical p-values. See the end of the 'Details' section of plotEffects
for explanation and guidance.
If the number of variables is larger than 100, not all possible variable pairs are considered, but, using a screening procedure, the
5000 variable pairs with the strongest indications of interaction effects are pre-selected.
NOTE: To make interpretations, it is crucial to investigate (visually) the forms the interaction effects of variable pairs
with large quantitative and qualitative EIM values take. This can be done using the plot function plot.interactionfor
(first overview) and plotEffects
.
NOTE ALSO: As described in Hornung & Boulesteix (2022), in the case of data with larger numbers of variables (larger than 100,
but more seriously for high-dimensional data), the univariable EIM values can be biased. Therefore, it is strongly recommended
to interpret the univariable EIM values with caution, if the data are high-dimensional. If it is of interest to measure the univariable
importance of the variables for high-dimensional data, an additional conventional random forest (e.g., using the ranger
package)
should be constructed and the variable importance measure values of this random forest be used for ranking the univariable effects.
For large data sets with many observations the calculation of the EIM values can become very costly - when using fully grown trees.
Therefore, when calculating EIM values for data sets with more than 1000 observations we use the following
maximum tree depths by default (argument: simplify.large.n = TRUE
):
if
: Use fully grown trees.
if
: Use tree depth 10.
if
: Use tree depth 7.
if
: Use tree depth 5.
Extensive analyses in Hornung & Boulesteix (2022) suggest that by restricting the tree depth in this way,
the EIM values that would result when using fully grown trees are approximated well. However, the prediction
performance suffers, when using restricted trees. Therefore, we restrict the tree depth only when calculating
the EIM values (if ), but construct a second interaction forest with unrestricted tree depth,
which is then used for prediction purposes.
Value
Object of class interactionfor
with elements
predictions |
Predicted classes/values, based on out-of-bag samples (classification and regression only). |
num.trees |
Number of trees. |
num.independent.variables |
Number of independent variables. |
unique.death.times |
Unique death times (survival only). |
min.node.size |
Value of minimal node size used. |
npairs |
Number of variable pairs sampled for each split. |
eim.univ.sorted |
Univariable EIM values sorted in decreasing order. |
eim.univ |
Univariable EIM values. |
eim.qual.sorted |
Qualitative EIM values sorted in decreasing order. |
eim.qual |
Qualitative EIM values. |
eim.quant.sorted |
Quantitative EIM values sorted in decreasing order. |
eim.quant |
Quantitative EIM values. These values are labeled analoguously as those in |
prediction.error |
Overall out-of-bag prediction error. For classification this is the fraction of misclassified samples, for probability estimation the Brier score, for regression the mean squared error and for survival one minus Harrell's C-index. This is 'NA' for data sets with more than 100 covariate variables, because for such data sets we pre-select the 5000 variable pairs with strongest indications of interaction effects. This pre-selection cannot be taken into account in the out-of-bag error estimation, which is why the out-of-bag error estimates would be (much) too optimistic for data sets with more than 100 covariate variables. |
forest |
Saved forest (If write.forest set to TRUE). Note that the variable IDs in the |
confusion.matrix |
Contingency table for classes and predictions based on out-of-bag samples (classification only). |
chf |
Estimated cumulative hazard function for each sample (survival only). |
survival |
Estimated survival function for each sample (survival only). |
splitrule |
Splitting rule. |
treetype |
Type of forest/tree. classification, regression or survival. |
r.squared |
R squared. Also called explained variance or coefficient of determination (regression only). Computed on out-of-bag data. |
call |
Function call. |
importance.mode |
Importance mode used. |
num.samples |
Number of samples. |
replace |
Sample with replacement. |
eim.quant.rawlists |
List containing the four vectors of un-adjusted 'raw' quantitative EIM values
and the four vectors of adjusted EIM values. These are usually not required by the user. |
promispairs |
List giving the indices of the variables in the pre-selected variable pairs. If the number of variables is at most 100, all variable pairs are considered. |
plotres |
List ob objects needed by the plot functions: |
Author(s)
Roman Hornung, Marvin N. Wright
References
Hornung, R., Boulesteix, A.-L. (2022). Interaction forests: Identifying and exploiting interpretable quantitative and qualitative interaction effects. Computational Statistics & Data Analysis 171:107460, <doi:10.1016/j.csda.2022.107460>.
Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <doi:10.1007/s42979-021-00920-1>.
Peto, R., (1982) Statistical aspects of cancer trials. In: K.E. Halnam (Ed.), Treatment of Cancer. Chapman & Hall: London.
Wright, M. N., Ziegler, A. (2017). ranger: A fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software 77:1-17, <doi:10.18637/jss.v077.i01>.
Breiman, L. (2001). Random forests. Machine Learning 45:5-32, <doi:10.1023/A:1010933404324>.
Malley, J. D., Kruppa, J., Dasgupta, A., Malley, K. G., & Ziegler, A. (2012). Probability machines: consistent probability estimation using nonparametric learning machines. Methods of Information in Medicine 51:74-81, <doi:10.3414/ME00-01-0052>.
Meinshausen (2006). Quantile Regression Forests. Journal of Machine Learning Research 7:983-999.
See Also
predict.divfor
, plot.interactionfor
, plotEffects
Examples
## Not run:
## Load package:
library("diversityForest")
## Set seed to make results reproducible:
set.seed(1234)
## Construct interaction forests and calculate EIM values:
# Binary outcome:
data(zoo)
modelcat <- interactionfor(dependent.variable.name = "type", data = zoo,
num.trees = 20)
# Metric outcome:
data(stock)
modelcont <- interactionfor(dependent.variable.name = "company10", data = stock,
num.trees = 20)
# Survival outcome:
library("survival")
mgus2$id <- NULL # 'mgus2' data set is contained in the 'survival' package
# categorical variables need to be of factor format - important!!
mgus2$sex <- factor(mgus2$sex)
mgus2$pstat <- factor(mgus2$pstat)
# Remove the second time variable 'ptime':
mgus2$ptime <- NULL
# Remove missing values:
mgus2 <- mgus2[complete.cases(mgus2),]
# Take subset to make the calculations less computationally
# expensive for the example (in actual applications, we would of course
# use the whole data set):
mgus2sub <- mgus2[sample(1:nrow(mgus2), size=500),]
# Apply 'interactionfor':
modelsurv <- interactionfor(formula = Surv(futime, death) ~ ., data=mgus2sub, num.trees=20)
# NOTE: num.trees = 20 (in the above) would be much too small for practical
# purposes. This small number of trees was simply used to keep the
# runtime of the example short.
# The default number of trees is num.trees = 20000 if EIM values are calculated
# and num.trees = 2000 otherwise.
## Inspect the rankings of the variables and variable pairs with respect to
## the univariable, quantitative, and qualitative EIM values:
# Univariable EIM values:
modelcat$eim.univ.sorted
# Pairs with top quantitative EIM values:
modelcat$eim.quant.sorted[1:5]
# Pairs with top qualitative EIM values:
modelcat$eim.qual.sorted[1:5]
## Investigate visually the forms of the interaction effects of the variable pairs with
## largest quantitative and qualitative EIM values:
plot(modelcat)
plotEffects(modelcat, type="quant") # type="quant" is default.
plotEffects(modelcat, type="qual")
## Prediction:
# Separate 'zoo' data set randomly in training
# and test data:
data(zoo)
train.idx <- sample(nrow(zoo), 2/3 * nrow(zoo))
zoo.train <- zoo[train.idx, ]
zoo.test <- zoo[-train.idx, ]
# Construct interaction forest on training data:
# NOTE again: num.trees = 20 is specified too small for practical purposes.
modelcattrain <- interactionfor(dependent.variable.name = "type", data = zoo,
importance = "none", num.trees = 20)
# NOTE: Because we are only interested in prediction here, we do not
# calculate EIM values (by setting importance = "none"), because this
# speeds up calculations.
# Predict class values of the test data:
pred.zoo <- predict(modelcattrain, data = zoo.test)
# Compare predicted and true class values of the test data:
table(zoo.test$type, pred.zoo$predictions)
## End(Not run)