vimp_anova {vimp} | R Documentation |
Nonparametric Intrinsic Variable Importance Estimates: ANOVA
Description
Compute estimates of and confidence intervals for nonparametric ANOVA-based
intrinsic variable importance. This is a wrapper function for cv_vim
,
with type = "anova"
. This type
has limited functionality compared to other
types; in particular, null hypothesis tests
are not possible using type = "anova"
.
If you want to do null hypothesis testing
on an equivalent population parameter, use
vimp_rsquared
instead.
Usage
vimp_anova(
Y = NULL,
X = NULL,
cross_fitted_f1 = NULL,
cross_fitted_f2 = NULL,
indx = 1,
V = 10,
run_regression = TRUE,
SL.library = c("SL.glmnet", "SL.xgboost", "SL.mean"),
alpha = 0.05,
delta = 0,
na.rm = FALSE,
cross_fitting_folds = NULL,
stratified = FALSE,
C = rep(1, length(Y)),
Z = NULL,
ipc_weights = rep(1, length(Y)),
scale = "logit",
ipc_est_type = "aipw",
scale_est = TRUE,
cross_fitted_se = TRUE,
...
)
Arguments
Y |
the outcome. |
X |
the covariates. If |
cross_fitted_f1 |
the predicted values on validation data from a
flexible estimation technique regressing Y on X in the training data. Provided as
either (a) a vector, where each element is
the predicted value when that observation is part of the validation fold;
or (b) a list of length V, where each element in the list is a set of predictions on the
corresponding validation data fold.
If sample-splitting is requested, then these must be estimated specially; see Details. However,
the resulting vector should be the same length as |
cross_fitted_f2 |
the predicted values on validation data from a
flexible estimation technique regressing either (a) the fitted values in
|
indx |
the indices of the covariate(s) to calculate variable importance for; defaults to 1. |
V |
the number of folds for cross-fitting, defaults to 5. If
|
run_regression |
if outcome Y and covariates X are passed to
|
SL.library |
a character vector of learners to pass to
|
alpha |
the level to compute the confidence interval at. Defaults to 0.05, corresponding to a 95% confidence interval. |
delta |
the value of the |
na.rm |
should we remove NAs in the outcome and fitted values
in computation? (defaults to |
cross_fitting_folds |
the folds for cross-fitting. Only used if
|
stratified |
if run_regression = TRUE, then should the generated folds be stratified based on the outcome (helps to ensure class balance across cross-validation folds) |
C |
the indicator of coarsening (1 denotes observed, 0 denotes unobserved). |
Z |
either (i) NULL (the default, in which case the argument
|
ipc_weights |
weights for the computed influence curve (i.e., inverse probability weights for coarsened-at-random settings). Assumed to be already inverted (i.e., ipc_weights = 1 / [estimated probability weights]). |
scale |
should CIs be computed on original ("identity") or another scale? (options are "log" and "logit") |
ipc_est_type |
the type of procedure used for coarsened-at-random
settings; options are "ipw" (for inverse probability weighting) or
"aipw" (for augmented inverse probability weighting).
Only used if |
scale_est |
should the point estimate be scaled to be greater than or equal to 0?
Defaults to |
cross_fitted_se |
should we use cross-fitting to estimate the standard
errors ( |
... |
other arguments to the estimation tool, see "See also". |
Details
We define the population ANOVA
parameter for the group of features (or single feature) s
by
\psi_{0,s} := E_0\{f_0(X) - f_{0,s}(X)\}^2/var_0(Y),
where f_0
is the population conditional mean using all features,
f_{0,s}
is the population conditional mean using the features with
index not in s
, and E_0
and var_0
denote expectation and
variance under the true data-generating distribution, respectively.
Cross-fitted ANOVA estimates are computed by first
splitting the data into K
folds; then using each fold in turn as a
hold-out set, constructing estimators f_{n,k}
and f_{n,k,s}
of
f_0
and f_{0,s}
, respectively on the training data and estimator
E_{n,k}
of E_0
using the test data; and finally, computing
\psi_{n,s} := K^{(-1)}\sum_{k=1}^K E_{n,k}\{f_{n,k}(X) - f_{n,k,s}(X)\}^2/var_n(Y),
where var_n
is the empirical variance.
See the paper by Williamson, Gilbert, Simon, and Carone for more
details on the mathematics behind this function.
Value
An object of classes vim
and vim_anova
.
See Details for more information.
See Also
SuperLearner
for specific usage of the
SuperLearner
function and package.
Examples
# generate the data
# generate X
p <- 2
n <- 100
x <- data.frame(replicate(p, stats::runif(n, -5, 5)))
# apply the function to the x's
smooth <- (x[,1]/5)^2*(x[,1]+7)/5 + (x[,2]/3)^2
# generate Y ~ Normal (smooth, 1)
y <- smooth + stats::rnorm(n, 0, 1)
# set up a library for SuperLearner; note simple library for speed
library("SuperLearner")
learners <- c("SL.glm", "SL.mean")
# estimate (with a small number of folds, for illustration only)
est <- vimp_anova(y, x, indx = 2,
alpha = 0.05, run_regression = TRUE,
SL.library = learners, V = 2, cvControl = list(V = 2))