R: Construct an interaction forest prediction rule and calculate...

interactionfor {diversityForest}

R Documentation

Construct an interaction forest prediction rule and calculate EIM values as described in Hornung & Boulesteix (2022).

Description

Implements interaction forests as described in Hornung & Boulesteix (2022). Currently, categorical, metric, and survival outcomes are supported. Interaction forests feature the effect importance measure (EIM), which can be used to rank the covariate variable pairs with respect to the impact of their interaction effects on prediction. This allows to identify relevant interaction effects. Interaction forests focus on well interpretable interaction effects. See the 'Details' section below for more details. In addition, we strongly recommend to consult Section C of Supplementary Material 1 of Hornung & Boulesteix (2022), which uses detailed examples of interaction forest analyses with code to illustrate how interaction forests can be used in applications: Link.

Usage

interactionfor(
  formula = NULL,
  data = NULL,
  importance = "both",
  num.trees = NULL,
  simplify.large.n = TRUE,
  num.trees.eim.large.n = NULL,
  write.forest = TRUE,
  probability = FALSE,
  min.node.size = NULL,
  max.depth = NULL,
  replace = FALSE,
  sample.fraction = ifelse(replace, 1, 0.7),
  case.weights = NULL,
  class.weights = NULL,
  splitrule = NULL,
  always.split.variables = NULL,
  keep.inbag = FALSE,
  inbag = NULL,
  holdout = FALSE,
  quantreg = FALSE,
  oob.error = TRUE,
  num.threads = NULL,
  verbose = TRUE,
  seed = NULL,
  dependent.variable.name = NULL,
  status.variable.name = NULL,
  npairs = NULL,
  classification = NULL
)

Arguments

`formula`	Object of class `formula` or `character` describing the model to fit.
`data`	Training data of class `data.frame`, `matrix`, `dgCMatrix` (Matrix) or `gwaa.data` (GenABEL).
`importance`	Effect importance mode. One of the following: "both" (the default), "qualitative", "quantitative", "mainonly", "none". See the 'Details' section below for explanation.
`num.trees`	Number of trees. The default number is 20000, if EIM values should be computed and 2000 otherwise. Note that if `simplify.large.n = TRUE` (default), the number of observations is larger than 1000, and EIM values should be calculated two forests are constructed, one for calculating the EIM values and one for prediction (cf. 'Details' section). In such cases, the default number of trees used for the forest for EIM value calculation is 20000 and the default number of trees used for the forest for prediction is 2000.
`simplify.large.n`	Should restricted tree depths be used, when calculating EIM values for large data sets? See the 'Details' section below for more information. Default is `TRUE`.
`num.trees.eim.large.n`	Number of trees in the forest used for calculating the EIM values for large data sets. If `num.trees` is provided, but not `num.trees.eim.large.n`, the value given by `num.trees` will be used. The default number is 20000. Only used when `simplify.large.n = TRUE`.
`write.forest`	Save `interaction.forest` object, required for prediction. Set to `FALSE` to reduce memory usage if no prediction intended.
`probability`	Grow a probability forest as in Malley et al. (2012).
`min.node.size`	Minimal node size. Default 1 for classification, 5 for regression, 3 for survival, and 5 for probability.
`max.depth`	Maximal tree depth. A value of NULL or 0 (the default) corresponds to unlimited depth, 1 to tree stumps (1 split per tree).
`replace`	Sample with replacement. Default is `FALSE`.
`sample.fraction`	Fraction of observations to sample. Default is 1 for sampling with replacement and 0.7 for sampling without replacement. For classification, this can be a vector of class-specific values.
`case.weights`	Weights for sampling of training observations. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples for the trees.
`class.weights`	Weights for the outcome classes (in order of the factor levels) in the splitting rule (cost sensitive learning). Classification and probability prediction only. For classification the weights are also applied in the majority vote in terminal nodes.
`splitrule`	Splitting rule. For classification and probability estimation "gini" or "extratrees" with default "gini". For regression "variance", "extratrees" or "maxstat" with default "variance". For survival "logrank", "extratrees", "C" or "maxstat" with default "logrank". NOTE: For interaction forests currently only the default splitting rules are supported.
`always.split.variables`	Currently not useable. Character vector with variable names to be always selected.
`keep.inbag`	Save how often observations are in-bag in each tree.
`inbag`	Manually set observations per tree. List of size num.trees, containing inbag counts for each observation. Can be used for stratified sampling.
`holdout`	Hold-out mode. Hold-out all samples with case weight 0 and use these for variable importance and prediction error. NOTE: Currently, not useable for interaction forests.
`quantreg`	Prepare quantile prediction as in quantile regression forests (Meinshausen 2006). Regression only. Set `keep.inbag = TRUE` to prepare out-of-bag quantile prediction. NOTE: Currently, not useable for interaction forests.
`oob.error`	Compute OOB prediction error. Set to `FALSE` to save computation time, e.g. for large survival forests.
`num.threads`	Number of threads. Default is number of CPUs available.
`verbose`	Show computation status and estimated runtime.
`seed`	Random seed. Default is `NULL`, which generates the seed from `R`. Set to `0` to ignore the `R` seed.
`dependent.variable.name`	Name of outcome variable, needed if no formula given. For survival forests this is the time variable.
`status.variable.name`	Name of status variable, only applicable to survival data and needed if no formula given. Use 1 for event and 0 for censoring.
`npairs`	Number of variable pairs to sample for each split. Default is the square root of the number of independent variables divided by 2 (this number is rounded up).
`classification`	Only needed if data is a matrix. Set to `TRUE` to grow a classification forest.

Details

The effect importance measure (EIM) of interaction forests distinguishes quantitative and qualitative interaction effects (Peto, 1982). This is a common distinction as these two types of interaction effects are interpreted in different ways (see below). For both of these types, EIM values for each variable pair are obtained: the quantitative and qualitative EIM values.
Interaction forests target easily interpretable types of interaction effects. These can be communicated clearly using statements of the following kind: "The strength of the positive (negative) effect of variable A on the outcome depends on the level of variable B" for quantitative interactions, and "for observations with small values of variable B, the effect of variable A is positive (negative), but for observations with large values of B, the effect of A is negative (positive)" for qualitative interactions.
In addition to calculating EIM values for variable pairs, importance values for the individual variables are calculated as well, the univariable EIM values. These measure the variable importance as in the case of classical variable importance measures of random forests.
The effect importance mode can be set via the importance argument: "qualitative": Calculate only qualitative EIM values; "quantitative": Calculate only quantitative EIM values; "both" (the default): Calculate qualitative and quantitative EIM values; "mainonly": Calculate only univariable EIM values.
The top variable pairs with largest quantitative and qualitative EIM values likely have quantitative and qualitative interactions, respectively, which have a considerable impact on prediction. The top variables with largest univariable EIM values likely have a considerable impact on prediction. Note that it is currently not possible to test the EIM values for statistical significance using the interaction forests algorithm itself. However, the p-values shown in the plots obtained with plotEffects (which are obtained using bivariable models) can be adjusted for multiple testing using the Bonferroni procedure to obtain practical p-values. See the end of the 'Details' section of plotEffects for explanation and guidance.
If the number of variables is larger than 100, not all possible variable pairs are considered, but, using a screening procedure, the 5000 variable pairs with the strongest indications of interaction effects are pre-selected.
NOTE: To make interpretations, it is crucial to investigate (visually) the forms the interaction effects of variable pairs with large quantitative and qualitative EIM values take. This can be done using the plot function plot.interactionfor (first overview) and plotEffects.
NOTE ALSO: As described in Hornung & Boulesteix (2022), in the case of data with larger numbers of variables (larger than 100, but more seriously for high-dimensional data), the univariable EIM values can be biased. Therefore, it is strongly recommended to interpret the univariable EIM values with caution, if the data are high-dimensional. If it is of interest to measure the univariable importance of the variables for high-dimensional data, an additional conventional random forest (e.g., using the ranger package) should be constructed and the variable importance measure values of this random forest be used for ranking the univariable effects.
For large data sets with many observations the calculation of the EIM values can become very costly - when using fully grown trees. Therefore, when calculating EIM values for data sets with more than 1000 observations we use the following maximum tree depths by default (argument: simplify.large.n = TRUE):

if n \le 1000: Use fully grown trees.
if 1000 < n \le 2000: Use tree depth 10.
if 2000 < n \le 5000: Use tree depth 7.
if n > 5000: Use tree depth 5.

Extensive analyses in Hornung & Boulesteix (2022) suggest that by restricting the tree depth in this way, the EIM values that would result when using fully grown trees are approximated well. However, the prediction performance suffers, when using restricted trees. Therefore, we restrict the tree depth only when calculating the EIM values (if n > 1000), but construct a second interaction forest with unrestricted tree depth, which is then used for prediction purposes.

Value

Object of class interactionfor with elements

`predictions`	Predicted classes/values, based on out-of-bag samples (classification and regression only).
`num.trees`	Number of trees.
`num.independent.variables`	Number of independent variables.
`unique.death.times`	Unique death times (survival only).
`min.node.size`	Value of minimal node size used.
`npairs`	Number of variable pairs sampled for each split.
`eim.univ.sorted`	Univariable EIM values sorted in decreasing order.
`eim.univ`	Univariable EIM values.
`eim.qual.sorted`	Qualitative EIM values sorted in decreasing order.
`eim.qual`	Qualitative EIM values.
`eim.quant.sorted`	Quantitative EIM values sorted in decreasing order. The labeling of these values provides the information on the type of quantitative interactions the respective variable pairs feature. For example, consider a variable pair A and B and say the label reads "A large AND B small". This would mean that if the value of A is large and, at the same time, the value of B is small, the expected value of the outcome variable is (considerably) different from all other cases. For this type of quantitative interaction, the effect of B is weak for small values of A and strong for large values of A. See Hornung & Boulesteix (2022) for more information on the types of quantitative interaction effects targeted by interaction forest.
`eim.quant`	Quantitative EIM values. These values are labeled analoguously as those in `eim.quant.sorted`.
`prediction.error`	Overall out-of-bag prediction error. For classification this is the fraction of misclassified samples, for probability estimation the Brier score, for regression the mean squared error and for survival one minus Harrell's C-index. This is 'NA' for data sets with more than 100 covariate variables, because for such data sets we pre-select the 5000 variable pairs with strongest indications of interaction effects. This pre-selection cannot be taken into account in the out-of-bag error estimation, which is why the out-of-bag error estimates would be (much) too optimistic for data sets with more than 100 covariate variables.
`forest`	Saved forest (If write.forest set to TRUE). Note that the variable IDs in the `split.multvarIDs` object do not necessarily represent the column number in R.
`confusion.matrix`	Contingency table for classes and predictions based on out-of-bag samples (classification only).
`chf`	Estimated cumulative hazard function for each sample (survival only).
`survival`	Estimated survival function for each sample (survival only).
`splitrule`	Splitting rule.
`treetype`	Type of forest/tree. classification, regression or survival.
`r.squared`	R squared. Also called explained variance or coefficient of determination (regression only). Computed on out-of-bag data.
`call`	Function call.
`importance.mode`	Importance mode used.
`num.samples`	Number of samples.
`replace`	Sample with replacement.
`eim.quant.rawlists`	List containing the four vectors of un-adjusted 'raw' quantitative EIM values and the four vectors of adjusted EIM values. These are usually not required by the user. For each of the four types of quantitative splits there exists a separate vector of raw quantitative EIM values. For example, `eim.quant.large.small.raw` contains the raw quantitative EIM values of the quantitative split type associated with quantitative interaction effects for which the expected values of the outcome variable are different, if the value of variable A is large and, at the same time, the value of variable B is small. The list entries of the un-adjusted 'raw' quantitative EIM values are labeled with the suffix `.raw`, while the list entries of the adjusted quantitative EIM values miss this suffix. See Hornung & Boulesteix (2022) for details on the raw and adjusted EIM values.
`promispairs`	List giving the indices of the variables in the pre-selected variable pairs. If the number of variables is at most 100, all variable pairs are considered.
`plotres`	List ob objects needed by the plot functions: `eim.univ.order` contains the sorting of the univariable EIM values in descending order, where the first element gives the index of the variable with largest EIM value, the second element the index of the variable with second-largest EIM value and so on; `eim.qual.order` / `eim.quant.order` contains the sorting in descending order of the qualitative / quantitative EIM values for the (pre-selected) variable pairs given by the object `promispairs` above. The first element gives the index of the (pre-selected) variable pair with largest qualitative / quantitative EIM value, the second element the index of the variable pair with second-largest qualitative / quantitative EIM value; `data` contains the data; `yvarname` is the name of the outcome variable (survival time for survival); `statusvarname` is the name of the status variable.

Author(s)

Roman Hornung, Marvin N. Wright

References

Hornung, R., Boulesteix, A.-L. (2022). Interaction forests: Identifying and exploiting interpretable quantitative and qualitative interaction effects. Computational Statistics & Data Analysis 171:107460, <doi:10.1016/j.csda.2022.107460>.
Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <doi:10.1007/s42979-021-00920-1>.
Peto, R., (1982) Statistical aspects of cancer trials. In: K.E. Halnam (Ed.), Treatment of Cancer. Chapman & Hall: London.
Wright, M. N., Ziegler, A. (2017). ranger: A fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software 77:1-17, <doi:10.18637/jss.v077.i01>.
Breiman, L. (2001). Random forests. Machine Learning 45:5-32, <doi:10.1023/A:1010933404324>.
Malley, J. D., Kruppa, J., Dasgupta, A., Malley, K. G., & Ziegler, A. (2012). Probability machines: consistent probability estimation using nonparametric learning machines. Methods of Information in Medicine 51:74-81, <doi:10.3414/ME00-01-0052>.
Meinshausen (2006). Quantile Regression Forests. Journal of Machine Learning Research 7:983-999.

Examples

## Not run: 

## Load package:

library("diversityForest")



## Set seed to make results reproducible:

set.seed(1234)



## Construct interaction forests and calculate EIM values:


# Binary outcome:
data(zoo)
modelcat <- interactionfor(dependent.variable.name = "type", data = zoo, 
  num.trees = 20)


# Metric outcome:
data(stock)
modelcont <- interactionfor(dependent.variable.name = "company10", data = stock, 
  num.trees = 20)  
  
  
# Survival outcome:
library("survival")
mgus2$id <- NULL  # 'mgus2' data set is contained in the 'survival' package

# categorical variables need to be of factor format - important!!
mgus2$sex <- factor(mgus2$sex)
mgus2$pstat <- factor(mgus2$pstat)

# Remove the second time variable 'ptime':
mgus2$ptime <- NULL

# Remove missing values:
mgus2 <- mgus2[complete.cases(mgus2),]

# Take subset to make the calculations less computationally
# expensive for the example (in actual applications, we would of course
# use the whole data set):
mgus2sub <- mgus2[sample(1:nrow(mgus2), size=500),]

# Apply 'interactionfor':
modelsurv <- interactionfor(formula = Surv(futime, death) ~ ., data=mgus2sub, num.trees=20)

# NOTE: num.trees = 20 (in the above) would be much too small for practical 
# purposes. This small number of trees was simply used to keep the
# runtime of the example short.
# The default number of trees is num.trees = 20000 if EIM values are calculated
# and num.trees = 2000 otherwise.



## Inspect the rankings of the variables and variable pairs with respect to 
## the univariable, quantitative, and qualitative EIM values:

# Univariable EIM values: 
modelcat$eim.univ.sorted

# Pairs with top quantitative EIM values:
modelcat$eim.quant.sorted[1:5]

# Pairs with top qualitative EIM values:
modelcat$eim.qual.sorted[1:5]



## Investigate visually the forms of the interaction effects of the variable pairs with
## largest quantitative and qualitative EIM values:

plot(modelcat)
plotEffects(modelcat, type="quant") # type="quant" is default.
plotEffects(modelcat, type="qual")



## Prediction:

# Separate 'zoo' data set randomly in training
# and test data:

data(zoo)
train.idx <- sample(nrow(zoo), 2/3 * nrow(zoo))
zoo.train <- zoo[train.idx, ]
zoo.test <- zoo[-train.idx, ]

# Construct interaction forest on training data:
# NOTE again: num.trees = 20 is specified too small for practical purposes.
modelcattrain <- interactionfor(dependent.variable.name = "type", data = zoo, 
                                importance = "none", num.trees = 20)
# NOTE: Because we are only interested in prediction here, we do not
# calculate EIM values (by setting importance = "none"), because this
# speeds up calculations.

# Predict class values of the test data:
pred.zoo <- predict(modelcattrain, data = zoo.test)

# Compare predicted and true class values of the test data:
table(zoo.test$type, pred.zoo$predictions)

## End(Not run)

[Package diversityForest version 0.4.0 Index]