sp_vim {vimp} | R Documentation |
Shapley Population Variable Importance Measure (SPVIM) Estimates and Inference
Description
Compute estimates and confidence intervals for the SPVIMs, using cross-fitting.
Usage
sp_vim(
Y = NULL,
X = NULL,
V = 5,
type = "r_squared",
SL.library = c("SL.glmnet", "SL.xgboost", "SL.mean"),
univariate_SL.library = NULL,
gamma = 1,
alpha = 0.05,
delta = 0,
na.rm = FALSE,
stratified = FALSE,
verbose = FALSE,
sample_splitting = TRUE,
final_point_estimate = "split",
C = rep(1, length(Y)),
Z = NULL,
ipc_scale = "identity",
ipc_weights = rep(1, length(Y)),
ipc_est_type = "aipw",
scale = "identity",
scale_est = TRUE,
cross_fitted_se = TRUE,
...
)
Arguments
Y |
the outcome. |
X |
the covariates. If |
V |
the number of folds for cross-fitting, defaults to 5. If
|
type |
the type of importance to compute; defaults to
|
SL.library |
a character vector of learners to pass to
|
univariate_SL.library |
(optional) a character vector of learners to
pass to |
gamma |
the fraction of the sample size to use when sampling subsets
(e.g., |
alpha |
the level to compute the confidence interval at. Defaults to 0.05, corresponding to a 95% confidence interval. |
delta |
the value of the |
na.rm |
should we remove NAs in the outcome and fitted values
in computation? (defaults to |
stratified |
if run_regression = TRUE, then should the generated folds be stratified based on the outcome (helps to ensure class balance across cross-validation folds) |
verbose |
should |
sample_splitting |
should we use sample-splitting to estimate the full and
reduced predictiveness? Defaults to |
final_point_estimate |
if sample splitting is used, should the final point estimates
be based on only the sample-split folds used for inference ( |
C |
the indicator of coarsening (1 denotes observed, 0 denotes unobserved). |
Z |
either (i) NULL (the default, in which case the argument
|
ipc_scale |
what scale should the inverse probability weight correction be applied on (if any)? Defaults to "identity". (other options are "log" and "logit") |
ipc_weights |
weights for the computed influence curve (i.e., inverse probability weights for coarsened-at-random settings). Assumed to be already inverted (i.e., ipc_weights = 1 / [estimated probability weights]). |
ipc_est_type |
the type of procedure used for coarsened-at-random
settings; options are "ipw" (for inverse probability weighting) or
"aipw" (for augmented inverse probability weighting).
Only used if |
scale |
should CIs be computed on original ("identity") or another scale? (options are "log" and "logit") |
scale_est |
should the point estimate be scaled to be greater than or equal to 0?
Defaults to |
cross_fitted_se |
should we use cross-fitting to estimate the standard
errors ( |
... |
other arguments to the estimation tool, see "See also". |
Details
We define the SPVIM as the weighted average of the population
difference in predictiveness over all subsets of features not containing
feature j
.
This is equivalent to finding the solution to a population weighted least squares problem. This key fact allows us to estimate the SPVIM using weighted least squares, where we first sample subsets from the power set of all possible features using the Shapley sampling distribution; then use cross-fitting to obtain estimators of the predictiveness of each sampled subset; and finally, solve the least squares problem given in Williamson and Feng (2020).
See the paper by Williamson and Feng (2020) for more details on the mathematics behind this function, and the validity of the confidence intervals.
In the interest of transparency, we return most of the calculations
within the vim
object. This results in a list containing:
- SL.library
the library of learners passed to
SuperLearner
- v
the estimated predictiveness measure for each sampled subset
- fit_lst
the fitted values on the entire dataset from the chosen method for each sampled subset
- preds_lst
the cross-fitted predicted values from the chosen method for each sampled subset
- est
the estimated SPVIM value for each feature
- ics
the influence functions for each sampled subset
- var_v_contribs
the contibutions to the variance from estimating predictiveness
- var_s_contribs
the contributions to the variance from sampling subsets
- ic_lst
a list of the SPVIM influence function contributions
- se
the standard errors for the estimated variable importance
- ci
the
(1-\alpha) \times 100
% confidence intervals based on the variable importance estimates- p_value
p-values for the null hypothesis test of zero importance for each variable
- test_statistic
the test statistic for each null hypothesis test of zero importance
- test
a hypothesis testing decision for each null hypothesis test (for each variable having zero importance)
- gamma
the fraction of the sample size used when sampling subsets
- alpha
the level, for confidence interval calculation
- delta
the
delta
value used for hypothesis testing- y
the outcome
- ipc_weights
the weights
- scale
the scale on which CIs were computed
- mat
- a tibble with the estimates, SEs, CIs, hypothesis testing decisions, and p-values
Value
An object of class vim
. See Details for more information.
See Also
SuperLearner
for specific usage of the
SuperLearner
function and package.
Examples
n <- 100
p <- 2
# generate the data
x <- data.frame(replicate(p, stats::runif(n, -5, 5)))
# apply the function to the x's
smooth <- (x[,1]/5)^2*(x[,1]+7)/5 + (x[,2]/3)^2
# generate Y ~ Normal (smooth, 1)
y <- as.matrix(smooth + stats::rnorm(n, 0, 1))
# set up a library for SuperLearner; note simple library for speed
library("SuperLearner")
learners <- c("SL.glm")
# -----------------------------------------
# using Super Learner (with a small number of CV folds,
# for illustration only)
# -----------------------------------------
set.seed(4747)
est <- sp_vim(Y = y, X = x, V = 2, type = "r_squared",
SL.library = learners, alpha = 0.05)