vimp_rsquared {vimp} | R Documentation |
Nonparametric Intrinsic Variable Importance Estimates: R-squared
Description
Compute estimates of and confidence intervals for nonparametric $R^2$-based
intrinsic variable importance. This is a wrapper function for cv_vim
,
with type = "r_squared"
.
Usage
vimp_rsquared(
Y = NULL,
X = NULL,
cross_fitted_f1 = NULL,
cross_fitted_f2 = NULL,
f1 = NULL,
f2 = NULL,
indx = 1,
V = 10,
run_regression = TRUE,
SL.library = c("SL.glmnet", "SL.xgboost", "SL.mean"),
alpha = 0.05,
delta = 0,
na.rm = FALSE,
final_point_estimate = "split",
cross_fitting_folds = NULL,
sample_splitting_folds = NULL,
stratified = FALSE,
C = rep(1, length(Y)),
Z = NULL,
ipc_weights = rep(1, length(Y)),
scale = "logit",
ipc_est_type = "aipw",
scale_est = TRUE,
cross_fitted_se = TRUE,
...
)
Arguments
Y |
the outcome. |
X |
the covariates. If |
cross_fitted_f1 |
the predicted values on validation data from a
flexible estimation technique regressing Y on X in the training data. Provided as
either (a) a vector, where each element is
the predicted value when that observation is part of the validation fold;
or (b) a list of length V, where each element in the list is a set of predictions on the
corresponding validation data fold.
If sample-splitting is requested, then these must be estimated specially; see Details. However,
the resulting vector should be the same length as |
cross_fitted_f2 |
the predicted values on validation data from a
flexible estimation technique regressing either (a) the fitted values in
|
f1 |
the fitted values from a flexible estimation technique
regressing Y on X. If sample-splitting is requested, then these must be
estimated specially; see Details. If |
f2 |
the fitted values from a flexible estimation technique
regressing either (a) |
indx |
the indices of the covariate(s) to calculate variable importance for; defaults to 1. |
V |
the number of folds for cross-fitting, defaults to 5. If
|
run_regression |
if outcome Y and covariates X are passed to
|
SL.library |
a character vector of learners to pass to
|
alpha |
the level to compute the confidence interval at. Defaults to 0.05, corresponding to a 95% confidence interval. |
delta |
the value of the |
na.rm |
should we remove NAs in the outcome and fitted values
in computation? (defaults to |
final_point_estimate |
if sample splitting is used, should the final point estimates
be based on only the sample-split folds used for inference ( |
cross_fitting_folds |
the folds for cross-fitting. Only used if
|
sample_splitting_folds |
the folds used for sample-splitting;
these identify the observations that should be used to evaluate
predictiveness based on the full and reduced sets of covariates, respectively.
Only used if |
stratified |
if run_regression = TRUE, then should the generated folds be stratified based on the outcome (helps to ensure class balance across cross-validation folds) |
C |
the indicator of coarsening (1 denotes observed, 0 denotes unobserved). |
Z |
either (i) NULL (the default, in which case the argument
|
ipc_weights |
weights for the computed influence curve (i.e., inverse probability weights for coarsened-at-random settings). Assumed to be already inverted (i.e., ipc_weights = 1 / [estimated probability weights]). |
scale |
should CIs be computed on original ("identity") or another scale? (options are "log" and "logit") |
ipc_est_type |
the type of procedure used for coarsened-at-random
settings; options are "ipw" (for inverse probability weighting) or
"aipw" (for augmented inverse probability weighting).
Only used if |
scale_est |
should the point estimate be scaled to be greater than or equal to 0?
Defaults to |
cross_fitted_se |
should we use cross-fitting to estimate the standard
errors ( |
... |
other arguments to the estimation tool, see "See also". |
Details
We define the population variable importance measure (VIM) for the
group of features (or single feature) s
with respect to the
predictiveness measure V
by
\psi_{0,s} := V(f_0, P_0) - V(f_{0,s}, P_0),
where f_0
is
the population predictiveness maximizing function, f_{0,s}
is the
population predictiveness maximizing function that is only allowed to access
the features with index not in s
, and P_0
is the true
data-generating distribution.
Cross-fitted VIM estimates are computed differently if sample-splitting
is requested versus if it is not. We recommend using sample-splitting
in most cases, since only in this case will inferences be valid if
the variable(s) of interest have truly zero population importance.
The purpose of cross-fitting is to estimate f_0
and f_{0,s}
on independent data from estimating P_0
; this can result in improved
performance, especially when using flexible learning algorithms. The purpose
of sample-splitting is to estimate f_0
and f_{0,s}
on independent
data; this allows valid inference under the null hypothesis of zero importance.
Without sample-splitting, cross-fitted VIM estimates are obtained by first
splitting the data into K
folds; then using each fold in turn as a
hold-out set, constructing estimators f_{n,k}
and f_{n,k,s}
of
f_0
and f_{0,s}
, respectively on the training data and estimator
P_{n,k}
of P_0
using the test data; and finally, computing
\psi_{n,s} := K^{(-1)}\sum_{k=1}^K \{V(f_{n,k},P_{n,k}) - V(f_{n,k,s}, P_{n,k})\}.
With sample-splitting, cross-fitted VIM estimates are obtained by first
splitting the data into 2K
folds. These folds are further divided
into 2 groups of folds. Then, for each fold k
in the first group,
estimator f_{n,k}
of f_0
is constructed using all data besides
the kth fold in the group (i.e., (2K - 1)/(2K)
of the data) and
estimator P_{n,k}
of P_0
is constructed using the held-out data
(i.e., 1/2K
of the data); then, computing
v_{n,k} = V(f_{n,k},P_{n,k}).
Similarly, for each fold k
in the second group,
estimator f_{n,k,s}
of f_{0,s}
is constructed using all data
besides the kth fold in the group (i.e., (2K - 1)/(2K)
of the data)
and estimator P_{n,k}
of P_0
is constructed using the held-out
data (i.e., 1/2K
of the data); then, computing
v_{n,k,s} = V(f_{n,k,s},P_{n,k}).
Finally,
\psi_{n,s} := K^{(-1)}\sum_{k=1}^K \{v_{n,k} - v_{n,k,s}\}.
See the paper by Williamson, Gilbert, Simon, and Carone for more
details on the mathematics behind the cv_vim
function, and the
validity of the confidence intervals.
In the interest of transparency, we return most of the calculations
within the vim
object. This results in a list including:
- s
the column(s) to calculate variable importance for
- SL.library
the library of learners passed to
SuperLearner
- full_fit
the fitted values of the chosen method fit to the full data (a list, for train and test data)
- red_fit
the fitted values of the chosen method fit to the reduced data (a list, for train and test data)
- est
the estimated variable importance
- naive
the naive estimator of variable importance
- eif
the estimated efficient influence function
- eif_full
the estimated efficient influence function for the full regression
- eif_reduced
the estimated efficient influence function for the reduced regression
- se
the standard error for the estimated variable importance
- ci
the
(1-\alpha) \times 100
% confidence interval for the variable importance estimate- test
a decision to either reject (TRUE) or not reject (FALSE) the null hypothesis, based on a conservative test
- p_value
a p-value based on the same test as
test
- full_mod
the object returned by the estimation procedure for the full data regression (if applicable)
- red_mod
the object returned by the estimation procedure for the reduced data regression (if applicable)
- alpha
the level, for confidence interval calculation
- sample_splitting_folds
the folds used for hypothesis testing
- cross_fitting_folds
the folds used for cross-fitting
- y
the outcome
- ipc_weights
the weights
- cluster_id
the cluster IDs
- mat
a tibble with the estimate, SE, CI, hypothesis testing decision, and p-value
Value
An object of classes vim
and vim_rsquared
.
See Details for more information.
See Also
SuperLearner
for specific usage of the
SuperLearner
function and package.
Examples
# generate the data
# generate X
p <- 2
n <- 100
x <- data.frame(replicate(p, stats::runif(n, -5, 5)))
# apply the function to the x's
smooth <- (x[,1]/5)^2*(x[,1]+7)/5 + (x[,2]/3)^2
# generate Y ~ Normal (smooth, 1)
y <- smooth + stats::rnorm(n, 0, 1)
# set up a library for SuperLearner; note simple library for speed
library("SuperLearner")
learners <- c("SL.glm", "SL.mean")
# estimate (with a small number of folds, for illustration only)
est <- vimp_rsquared(y, x, indx = 2,
alpha = 0.05, run_regression = TRUE,
SL.library = learners, V = 2, cvControl = list(V = 2))