vim {vimp} | R Documentation |
Nonparametric Intrinsic Variable Importance Estimates and Inference
Description
Compute estimates of and confidence intervals for nonparametric intrinsic variable importance based on the population-level contrast between the oracle predictiveness using the feature(s) of interest versus not.
Usage
vim(
Y = NULL,
X = NULL,
f1 = NULL,
f2 = NULL,
indx = 1,
type = "r_squared",
run_regression = TRUE,
SL.library = c("SL.glmnet", "SL.xgboost", "SL.mean"),
alpha = 0.05,
delta = 0,
scale = "identity",
na.rm = FALSE,
sample_splitting = TRUE,
sample_splitting_folds = NULL,
final_point_estimate = "split",
stratified = FALSE,
C = rep(1, length(Y)),
Z = NULL,
ipc_scale = "identity",
ipc_weights = rep(1, length(Y)),
ipc_est_type = "aipw",
scale_est = TRUE,
nuisance_estimators_full = NULL,
nuisance_estimators_reduced = NULL,
exposure_name = NULL,
bootstrap = FALSE,
b = 1000,
boot_interval_type = "perc",
clustered = FALSE,
cluster_id = rep(NA, length(Y)),
...
)
Arguments
Y |
the outcome. |
X |
the covariates. If |
f1 |
the fitted values from a flexible estimation technique
regressing Y on X. A vector of the same length as |
f2 |
the fitted values from a flexible estimation technique
regressing either (a) |
indx |
the indices of the covariate(s) to calculate variable importance for; defaults to 1. |
type |
the type of importance to compute; defaults to
|
run_regression |
if outcome Y and covariates X are passed to
|
SL.library |
a character vector of learners to pass to
|
alpha |
the level to compute the confidence interval at. Defaults to 0.05, corresponding to a 95% confidence interval. |
delta |
the value of the |
scale |
should CIs be computed on original ("identity") or another scale? (options are "log" and "logit") |
na.rm |
should we remove NAs in the outcome and fitted values
in computation? (defaults to |
sample_splitting |
should we use sample-splitting to estimate the full and
reduced predictiveness? Defaults to |
sample_splitting_folds |
the folds used for sample-splitting;
these identify the observations that should be used to evaluate
predictiveness based on the full and reduced sets of covariates, respectively.
Only used if |
final_point_estimate |
if sample splitting is used, should the final point estimates
be based on only the sample-split folds used for inference ( |
stratified |
if run_regression = TRUE, then should the generated folds be stratified based on the outcome (helps to ensure class balance across cross-validation folds) |
C |
the indicator of coarsening (1 denotes observed, 0 denotes unobserved). |
Z |
either (i) NULL (the default, in which case the argument
|
ipc_scale |
what scale should the inverse probability weight correction be applied on (if any)? Defaults to "identity". (other options are "log" and "logit") |
ipc_weights |
weights for the computed influence curve (i.e., inverse probability weights for coarsened-at-random settings). Assumed to be already inverted (i.e., ipc_weights = 1 / [estimated probability weights]). |
ipc_est_type |
the type of procedure used for coarsened-at-random
settings; options are "ipw" (for inverse probability weighting) or
"aipw" (for augmented inverse probability weighting).
Only used if |
scale_est |
should the point estimate be scaled to be greater than or equal to 0?
Defaults to |
nuisance_estimators_full |
(only used if |
nuisance_estimators_reduced |
(only used if |
exposure_name |
(only used if |
bootstrap |
should bootstrap-based standard error estimates be computed?
Defaults to |
b |
the number of bootstrap replicates (only used if |
boot_interval_type |
the type of bootstrap interval (one of |
clustered |
should the bootstrap resamples be performed on clusters
rather than individual observations? Defaults to |
cluster_id |
vector of the same length as |
... |
other arguments to the estimation tool, see "See also". |
Details
We define the population variable importance measure (VIM) for the
group of features (or single feature) with respect to the
predictiveness measure
by
where is
the population predictiveness maximizing function,
is the
population predictiveness maximizing function that is only allowed to access
the features with index not in
, and
is the true
data-generating distribution. VIM estimates are obtained by obtaining
estimators
and
of
and
,
respectively; obtaining an estimator
of
; and finally,
setting
.
In the interest of transparency, we return most of the calculations
within the vim
object. This results in a list including:
- s
the column(s) to calculate variable importance for
- SL.library
the library of learners passed to
SuperLearner
- type
the type of risk-based variable importance measured
- full_fit
the fitted values of the chosen method fit to the full data
- red_fit
the fitted values of the chosen method fit to the reduced data
- est
the estimated variable importance
- naive
the naive estimator of variable importance (only used if
type = "anova"
)- eif
the estimated efficient influence function
- eif_full
the estimated efficient influence function for the full regression
- eif_reduced
the estimated efficient influence function for the reduced regression
- se
the standard error for the estimated variable importance
- ci
the
% confidence interval for the variable importance estimate
- test
a decision to either reject (TRUE) or not reject (FALSE) the null hypothesis, based on a conservative test
- p_value
a p-value based on the same test as
test
- full_mod
the object returned by the estimation procedure for the full data regression (if applicable)
- red_mod
the object returned by the estimation procedure for the reduced data regression (if applicable)
- alpha
the level, for confidence interval calculation
- sample_splitting_folds
the folds used for sample-splitting (used for hypothesis testing)
- y
the outcome
- ipc_weights
the weights
- cluster_id
the cluster IDs
- mat
a tibble with the estimate, SE, CI, hypothesis testing decision, and p-value
Value
An object of classes vim
and the type of risk-based measure.
See Details for more information.
See Also
SuperLearner
for specific usage of the
SuperLearner
function and package.
Examples
# generate the data
# generate X
p <- 2
n <- 100
x <- data.frame(replicate(p, stats::runif(n, -1, 1)))
# apply the function to the x's
f <- function(x) 0.5 + 0.3*x[1] + 0.2*x[2]
smooth <- apply(x, 1, function(z) f(z))
# generate Y ~ Bernoulli (smooth)
y <- matrix(rbinom(n, size = 1, prob = smooth))
# set up a library for SuperLearner; note simple library for speed
library("SuperLearner")
learners <- c("SL.glm")
# using Y and X; use class-balanced folds
est_1 <- vim(y, x, indx = 2, type = "accuracy",
alpha = 0.05, run_regression = TRUE,
SL.library = learners, cvControl = list(V = 2),
stratified = TRUE)
# using pre-computed fitted values
set.seed(4747)
V <- 2
full_fit <- SuperLearner::CV.SuperLearner(Y = y, X = x,
SL.library = learners,
cvControl = list(V = 2),
innerCvControl = list(list(V = V)))
full_fitted <- SuperLearner::predict.SuperLearner(full_fit)$pred
# fit the data with only X1
reduced_fit <- SuperLearner::CV.SuperLearner(Y = full_fitted,
X = x[, -2, drop = FALSE],
SL.library = learners,
cvControl = list(V = 2, validRows = full_fit$folds),
innerCvControl = list(list(V = V)))
reduced_fitted <- SuperLearner::predict.SuperLearner(reduced_fit)$pred
est_2 <- vim(Y = y, f1 = full_fitted, f2 = reduced_fitted,
indx = 2, run_regression = FALSE, alpha = 0.05,
stratified = TRUE, type = "accuracy",
sample_splitting_folds = get_cv_sl_folds(full_fit$folds))