hstats {hstats} | R Documentation |
Calculate Interaction Statistics
Description
This is the main function of the package. It does the expensive calculations behind the following H-statistics:
Total interaction strength
H^2
, a statistic measuring the proportion of prediction variability unexplained by main effects ofv
, seeh2()
for details.Friedman and Popescu's statistic
H^2_j
of overall interaction strength per feature, seeh2_overall()
for details.Friedman and Popescu's statistic
H^2_{jk}
of pairwise interaction strength, seeh2_pairwise()
for details.Friedman and Popescu's statistic
H^2_{jkl}
of three-way interaction strength, seeh2_threeway()
for details. To save time, this statistic is not calculated by default. Setthreeway_m
to a value above 2 to get three-way statistics of thethreeway_m
variables with strongest overall interaction.
Furthermore, it allows to calculate an experimental partial dependence based
measure of feature importance, \textrm{PDI}_j^2
. It equals the proportion of
prediction variability unexplained by other features, see pd_importance()
for details. This statistic is not shown by summary()
or plot()
.
Instead of using summary()
, interaction statistics can also be obtained via the
more flexible functions h2()
, h2_overall()
, h2_pairwise()
, and
h2_threeway()
.
Usage
hstats(object, ...)
## Default S3 method:
hstats(
object,
X,
v = NULL,
pred_fun = stats::predict,
pairwise_m = 5L,
threeway_m = 0L,
approx = FALSE,
grid_size = 50L,
n_max = 500L,
eps = 1e-10,
w = NULL,
verbose = TRUE,
...
)
## S3 method for class 'ranger'
hstats(
object,
X,
v = NULL,
pred_fun = function(m, X, ...) stats::predict(m, X, ...)$predictions,
pairwise_m = 5L,
threeway_m = 0L,
approx = FALSE,
grid_size = 50L,
n_max = 500L,
eps = 1e-10,
w = NULL,
verbose = TRUE,
...
)
## S3 method for class 'explainer'
hstats(
object,
X = object[["data"]],
v = NULL,
pred_fun = object[["predict_function"]],
pairwise_m = 5L,
threeway_m = 0L,
approx = FALSE,
grid_size = 50L,
n_max = 500L,
eps = 1e-10,
w = object[["weights"]],
verbose = TRUE,
...
)
Arguments
object |
Fitted model object. |
... |
Additional arguments passed to |
X |
A data.frame or matrix serving as background dataset. |
v |
Vector of feature names. The default ( |
pred_fun |
Prediction function of the form |
pairwise_m |
Number of features for which pairwise statistics are to be
calculated. The features are selected based on Friedman and Popescu's overall
interaction strength |
threeway_m |
Like |
approx |
Should quantile approximation be applied to dense numeric features?
The default is |
grid_size |
Integer controlling the number of quantile midpoints used to
approximate dense numerics. The quantile midpoints are calculated after
subampling via |
n_max |
If |
eps |
Threshold below which numerator values are set to 0. Default is 1e-10. |
w |
Optional vector of case weights. Can also be a column name of |
verbose |
Should a progress bar be shown? The default is |
Value
An object of class "hstats" containing these elements:
-
X
: InputX
(sampled ton_max
rows, after optional quantile approximation). -
w
: Case weight vectorw
(sampled ton_max
values), orNULL
. -
v
: Vector of column names inX
for which overall H statistics have been calculated. -
f
: Matrix with (centered) predictionsF
. -
mean_f2
: (Weighted) column means off
. Used to normalizeH^2
andH^2_j
. -
F_j
: List of matrices, each representing (centered) partial dependence functionsF_j
. -
F_not_j
: List of matrices with (centered) partial dependence functionsF_{\setminus j}
of other features. -
K
: Number of columns of prediction matrix. -
pred_names
: Column names of prediction matrix. -
pairwise_m
: Like inputpairwise_m
, but capped atlength(v)
. -
threeway_m
: Like inputthreeway_m
, but capped at the smaller oflength(v)
andpairwise_m
. -
eps
: Like inputeps
. -
pd_importance
: List with numerator and denominator of\textrm{PDI}_j
. -
h2
: List with numerator and denominator ofH^2
. -
h2_overall
: List with numerator and denominator ofH^2_j
. -
v_pairwise
: Subset ofv
with largestH^2_j
used for pairwise calculations. Only if pairwise calculations have been done. -
combs2
: Named list of variable pairs for which pairwise partial dependence functions are available. Only if pairwise calculations have been done. -
F_jk
: List of matrices, each representing (centered) bivariate partial dependence functionsF_{jk}
. Only if pairwise calculations have been done. -
h2_pairwise
: List with numerator and denominator ofH^2_{jk}
. Only if pairwise calculations have been done. -
v_threeway
: Subset ofv
with largesth2_overall()
used for three-way calculations. Only if three-way calculations have been done. -
combs3
: Named list of variable triples for which three-way partial dependence functions are available. Only if three-way calculations have been done. -
F_jkl
: List of matrices, each representing (centered) three-way partial dependence functionsF_{jkl}
. Only if three-way calculations have been done. -
h2_threeway
: List with numerator and denominator ofH^2_{jkl}
. Only if three-way calculations have been done.
Methods (by class)
-
hstats(default)
: Default hstats method. -
hstats(ranger)
: Method for "ranger" models. -
hstats(explainer)
: Method for DALEX "explainer".
References
Friedman, Jerome H., and Bogdan E. Popescu. "Predictive Learning via Rule Ensembles." The Annals of Applied Statistics 2, no. 3 (2008): 916-54.
See Also
h2()
, h2_overall()
, h2_pairwise()
, h2_threeway()
,
and pd_importance()
for specific statistics calculated from the resulting object.
Examples
# MODEL 1: Linear regression
fit <- lm(Sepal.Length ~ . + Petal.Width:Species, data = iris)
s <- hstats(fit, X = iris[, -1])
s
plot(s)
plot(s, zero = FALSE) # Drop 0
summary(s)
# Absolute pairwise interaction strengths
h2_pairwise(s, normalize = FALSE, squared = FALSE, zero = FALSE)
# MODEL 2: Multi-response linear regression
fit <- lm(as.matrix(iris[, 1:2]) ~ Petal.Length + Petal.Width * Species, data = iris)
s <- hstats(fit, X = iris[, 3:5], verbose = FALSE)
plot(s)
summary(s)
# MODEL 3: Gamma GLM with log link
fit <- glm(Sepal.Length ~ ., data = iris, family = Gamma(link = log))
# No interactions for additive features, at least on link scale
s <- hstats(fit, X = iris[, -1], verbose = FALSE)
summary(s)
# On original scale, we have interactions everywhere.
# To see three-way interactions, we set threeway_m to a value above 2.
s <- hstats(fit, X = iris[, -1], type = "response", threeway_m = 5)
plot(s, ncol = 1) # All three types use different denominators
# All statistics on same scale (of predictions)
plot(s, squared = FALSE, normalize = FALSE, facet_scale = "free_y")