plotEffects {diversityForest} | R Documentation |
Interaction forest plots: exploring interaction forest results through visualisation
Description
This function allows to visualise the (estimated) bivariable influences of pairs of variables (with large quantitative and qualitative EIM values) on the outcome. This step is crucial, because to interpret interaction effects between variable pairs with large quantitative and qualitative EIM values, it is necessary to learn about the forms these interaction effects take.
Usage
plotEffects(
intobj,
type = "quant",
numpairs = 5,
indpairs = NULL,
pairs = NULL,
allwith = NULL,
pvalues = TRUE,
twoplots = TRUE,
addtitles = TRUE,
plotit = TRUE
)
Arguments
intobj |
Object of class |
type |
This can be either "quant" or "qual" and determines whether the plotted pairs are sorted according to either the quantitative or qualitative EIM values in decreasing order. Default is "quant". |
numpairs |
The number of pairs to plot (default: 5). This is overwritten by |
indpairs |
Optional. The indices of the pairs in the sorted lists of quantitative ( |
pairs |
This can be used to specify the pairs to plot. It is an optional list of character string vectors, where each of these vectors has length two. Each list element corresponds to one
pair, where the first character string gives the name of the first member of the respective pair to plot and the second character string gives the name of the second member.
This argument overwrites |
allwith |
This is an optional character string that can be set to the name of one of the variables. If provided, only variable pairs will
be considered that feature the variable specified by this argument |
pvalues |
Set to |
twoplots |
Set to |
addtitles |
Set to |
plotit |
This states whether the plots are actually plotted or merely returned as |
Details
For each considered pair the bivariable influence of both pair members on the outcome estimated
using a two-dimensional flexible function is shown. Such visualisations make it possible to learn
about the forms of the interaction effects between variable pairs with large EIM values. Moreover,
these visualisations reveal (pathological) cases in which variable pairs do not show
indications of interaction effects despite featuring large EIM values.
For binary outcomes the probabilities for the second class are estimated, for categorical outcomes with more than two classes
the probabilities for the largest class (i.e., the class with the most observations) are estimated (using the function plotPair
, a different
class can be selected instead), for metric outcomes the means of the outcome are estimated, and for survival outcomes
the log hazards ratio values compared to the median effect are estimated.
The kinds of estimates shown differ also according to whether both pair members are metric or only one of the two
members is metric and the other one categorical or both pair members are categorical:
If both pair members are metric and the outcome is categorical or metric we use two-dimensional LOESS regression, where in the case of categorical outcomes, to obtain probability estimates for the first class (or largest class for multi-class outcomes), we use the value '1' for the first class (largest class for multi-class outcomes) and the value '0' for the second class (all other classes for multi-class outcomes).
If both pair members are metric and the outcome is survival we use a Cox proportional hazard additive model with a two-dimensional LOESS smooth (
gamcox
function from the 'MapGAM' package (version 1.2-5)) and in the rare cases for which the latter fails, we use classical Cox regression with an interaction term between the two covariates.If one pair member is metric and the other one categorical and the outcome is categorical or metric, we use LOESS regression between the outcome (coded as '0' and '1' in the case of categorical outcomes as described above) and the values of the metric variable separately for each category of the categorical variable. In the rare cases in which the LOESS regression fails we use classical linear regression.
If one pair member is metric and the other one categorical and the outcome is survival, we use Cox regression with a linear tail-restricted cubic spline with four knots (univariable LOESS regression for survival outcomes does not seem to be available yet in R) separately for each category of the categorical variable. In cases in which the fitting of this spline regression fails we use classical Cox regression.
If both pair members are categorical and the outcome is categorical or metric, we simply calculate the mean of the outcome (coded as '0' and '1' in the case of categorical outcomes as described above) for each possible combination of the categories of the two variables.
If both pair members are categorical and the outcome is survival, we use classical Cox regression with an interaction term between the two variables (there is no need for any flexible modelling in this setting, because the Cox model with two categorical variables plus interaction term is saturated).
As described above (function argument: pvalues
), there is an option to add p-values from tests for interaction effect
to the plots. If at least one of the variables in the considered variable pair is categorical and features more
than two categories, there are more than one interaction terms in the regression approaches used for testing,
because the categorical variables are dummy-coded. Therefore, in these cases we obtain a p-value for each interaction term.
to obtain a single p-value for the test for interaction we adjust these multiple p-values using the Holm-Bonferroni
procedure and take the minimum of the adjusted p-values.
NOTE: These p-values are generally much too optimistic, in particular for small data sets and
large numbers of variables. The reason for this overoptimism is that these p-values are not adjusted
for the fact that we already used the data to find the variable pairs with strongest indications of
interaction effects. This is similar to a multiple testing problem.
Therefore, these p-values should only be seen as a rough guide to be interpreted very cautiously
and MUST NOT be reported as the results of a statistical test for significance
of interaction. To obtain adjusted p-values that would correspond to
valid tests, it would be possible to multiply these p-values by the number of possible pairs,
which would correspond to Bonferroni-adjusted p-values. For example, assume that we have 30
covariate variables. In that case the number of possible pairs would be 'choose(30, 2) = 435',
which is why we would need to multiply each p-value by 435
to obtain an adjusted p-value (or keep the original p-values and divide the significance
level 0.05 by 435). Note, however, that Bonferroni-adjusted p-values deliver quite conservative results,
that is, weaker effects might not be detected using these p-values, while, however, effects for
which these p-values are small (< 0.05
) are most likely relevant.
Note further that these Bonferroni-adjusted p-values should be interpreted
with caution because, stemming from bivariable models, these p-values do not take the
multivariable nature of the data into account.
Value
A list of ggplot2 plots returned invisibly.
Author(s)
Roman Hornung
References
Hornung, R., Boulesteix, A.-L. (2022). Interaction forests: Identifying and exploiting interpretable quantitative and qualitative interaction effects. Computational Statistics & Data Analysis 171:107460, <doi:10.1016/j.csda.2022.107460>.
Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <doi:10.1007/s42979-021-00920-1>.
See Also
Examples
## Not run:
## Load package:
library("diversityForest")
## Set seed to make results reproducible:
set.seed(1234)
## Construct interaction forest and calculate EIM values:
data(stock)
model <- interactionfor(dependent.variable.name = "company10", data = stock,
num.trees = 20)
# NOTE: num.trees = 20 (in the above) would be much too small for practical
# purposes. This small number of trees was simply used to keep the
# runtime of the example short.
# The default number of trees is num.trees = 20000 if EIM values are calculated
# and num.trees = 2000 otherwise.
## Obtain a first overview by applying the plot() function for
## interactionfor obects:
plot(model)
## Several possible application cases of the plotEffects() function:
# Visualise the estimated bivariable influences of the five variable pairs with the
# largest quantitative EIM values:
plotEffects(model) # type="quant" is default.
# Visualise the estimated bivariable influences of the five pairs with the
# largest qualitative EIM values:
plotEffects(model, type="qual")
# Visualise the estimated bivariable influences of all (eight) pairs that involve
# the variable "company7" sorted in decreasing order according to the
# qualitative EIM values:
plotEffects(model, allwith="company7", type="qual", numpairs=8)
# Visualise the estimated bivariable influences of the pairs with third and fifth
# largest qualitative EIM values:
plotEffects(model, type="qual", indpairs=c(3,5))
# Visualise the estimated bivariable influences of the pairs ("company3", "company5") and
# ("company1", "company9"):
plotEffects(model, pairs=list(c("company3", "company5"), c("company1", "company9")))
## Saving of plots generated with the plotEffects() function (e.g., for use in publications):
# Apply plotEffects() to obtain plots for the five variable pairs
# with the largest qualitative EIM values and store these plots in
# an object 'ps':
ps <- plotEffects(model, type="qual", pvalues=FALSE, twoplots=FALSE, addtitles=FALSE, plotit=FALSE)
# pvalues = FALSE states that no p-values should be shown in the plots,
# because these might not be desired in plots meant for publication.
# twoplots = FALSE ensures that we get one plot for each page instead of two plots per page.
# addtitles = FALSE removes the automatically generated titles, because these are likely
# not desired in publications.
# plotit = FALSE ensures that the plots are not displayed, but only returned (invisibly)
# by plotEffects().
# Save the plot with second largest qualitative EIM value:
p1 <- ps[[2]]
# Add title:
library("ggpubr")
p1 <- annotate_figure(p1, top = text_grob("My descriptive plot title 1", face = "bold", size = 14))
p1
# Save as PDF:
# library("ggplot2")
# ggsave(file="mypathtofolder/FigureXY1.pdf", width=14, height=6)
# Save the plot with fifth largest qualitative EIM value:
p2 <- ps[[5]]
# Add title:
p2 <- annotate_figure(p2, top = text_grob("My descriptive plot title 2", face = "bold", size = 14))
p2
# Save as PDF:
# ggsave(file="mypathtofolder/FigureXY1.pdf", width=14, height=6)
# Combine both of the above plots:
p <- ggarrange(p1, p2, nrow = 2)
p
# Save the combined plot:
# ggsave(file="mypathtofolder/FigureXYcombined.pdf", width=14, height=11)
# NOTE: Using plotEffects() it is not possible to change the plots
# themselves (by e.g., increasing the label sizes or changing the
# axes ranges). However, the function plotPair() can be used to change
# the plots themselves.
## End(Not run)