R: Nonparametric Intrinsic Variable Importance Estimates and...

cv_vim {vimp}

R Documentation

Nonparametric Intrinsic Variable Importance Estimates and Inference using Cross-fitting

Description

Compute estimates and confidence intervals using cross-fitting for nonparametric intrinsic variable importance based on the population-level contrast between the oracle predictiveness using the feature(s) of interest versus not.

Usage

cv_vim(
  Y = NULL,
  X = NULL,
  cross_fitted_f1 = NULL,
  cross_fitted_f2 = NULL,
  f1 = NULL,
  f2 = NULL,
  indx = 1,
  V = ifelse(is.null(cross_fitting_folds), 5, length(unique(cross_fitting_folds))),
  sample_splitting = TRUE,
  final_point_estimate = "split",
  sample_splitting_folds = NULL,
  cross_fitting_folds = NULL,
  stratified = FALSE,
  type = "r_squared",
  run_regression = TRUE,
  SL.library = c("SL.glmnet", "SL.xgboost", "SL.mean"),
  alpha = 0.05,
  delta = 0,
  scale = "identity",
  na.rm = FALSE,
  C = rep(1, length(Y)),
  Z = NULL,
  ipc_scale = "identity",
  ipc_weights = rep(1, length(Y)),
  ipc_est_type = "aipw",
  scale_est = TRUE,
  nuisance_estimators_full = NULL,
  nuisance_estimators_reduced = NULL,
  exposure_name = NULL,
  cross_fitted_se = TRUE,
  bootstrap = FALSE,
  b = 1000,
  boot_interval_type = "perc",
  clustered = FALSE,
  cluster_id = rep(NA, length(Y)),
  ...
)

Arguments

`Y`	the outcome.
`X`	the covariates. If `type = "average_value"`, then the exposure variable should be part of `X`, with its name provided in `exposure_name`.
`cross_fitted_f1`	the predicted values on validation data from a flexible estimation technique regressing Y on X in the training data. Provided as either (a) a vector, where each element is the predicted value when that observation is part of the validation fold; or (b) a list of length V, where each element in the list is a set of predictions on the corresponding validation data fold. If sample-splitting is requested, then these must be estimated specially; see Details. However, the resulting vector should be the same length as `Y`; if using a list, then the summed length of each element across the list should be the same length as `Y` (i.e., each observation is included in the predictions).
`cross_fitted_f2`	the predicted values on validation data from a flexible estimation technique regressing either (a) the fitted values in `cross_fitted_f1`, or (b) Y, on X withholding the columns in `indx`. Provided as either (a) a vector, where each element is the predicted value when that observation is part of the validation fold; or (b) a list of length V, where each element in the list is a set of predictions on the corresponding validation data fold. If sample-splitting is requested, then these must be estimated specially; see Details. However, the resulting vector should be the same length as `Y`; if using a list, then the summed length of each element across the list should be the same length as `Y` (i.e., each observation is included in the predictions).
`f1`	the fitted values from a flexible estimation technique regressing Y on X. If sample-splitting is requested, then these must be estimated specially; see Details. If `cross_fitted_se = TRUE`, then this argument is not used.
`f2`	the fitted values from a flexible estimation technique regressing either (a) `f1` or (b) Y on X withholding the columns in `indx`. If sample-splitting is requested, then these must be estimated specially; see Details. If `cross_fitted_se = TRUE`, then this argument is not used.
`indx`	the indices of the covariate(s) to calculate variable importance for; defaults to 1.
`V`	the number of folds for cross-fitting, defaults to 5. If `sample_splitting = TRUE`, then a special type of `V`-fold cross-fitting is done. See Details for a more detailed explanation.
`sample_splitting`	should we use sample-splitting to estimate the full and reduced predictiveness? Defaults to `TRUE`, since inferences made using `sample_splitting = FALSE` will be invalid for variables with truly zero importance.
`final_point_estimate`	if sample splitting is used, should the final point estimates be based on only the sample-split folds used for inference (`"split"`, the default), or should they instead be based on the full dataset (`"full"`) or the average across the point estimates from each sample split (`"average"`)? All three options result in valid point estimates – sample-splitting is only required for valid inference.
`sample_splitting_folds`	the folds used for sample-splitting; these identify the observations that should be used to evaluate predictiveness based on the full and reduced sets of covariates, respectively. Only used if `run_regression = FALSE`.
`cross_fitting_folds`	the folds for cross-fitting. Only used if `run_regression = FALSE`.
`stratified`	if run_regression = TRUE, then should the generated folds be stratified based on the outcome (helps to ensure class balance across cross-validation folds)
`type`	the type of importance to compute; defaults to `r_squared`, but other supported options are `auc`, `accuracy`, `deviance`, and `anova`.
`run_regression`	if outcome Y and covariates X are passed to `vimp_accuracy`, and `run_regression` is `TRUE`, then Super Learner will be used; otherwise, variable importance will be computed using the inputted fitted values.
`SL.library`	a character vector of learners to pass to `SuperLearner`, if `f1` and `f2` are Y and X, respectively. Defaults to `SL.glmnet`, `SL.xgboost`, and `SL.mean`.
`alpha`	the level to compute the confidence interval at. Defaults to 0.05, corresponding to a 95% confidence interval.
`delta`	the value of the `\delta`-null (i.e., testing if importance < `\delta`); defaults to 0.
`scale`	should CIs be computed on original ("identity") or another scale? (options are "log" and "logit")
`na.rm`	should we remove NAs in the outcome and fitted values in computation? (defaults to `FALSE`)
`C`	the indicator of coarsening (1 denotes observed, 0 denotes unobserved).
`Z`	either (i) NULL (the default, in which case the argument `C` above must be all ones), or (ii) a character vector specifying the variable(s) among Y and X that are thought to play a role in the coarsening mechanism. To specify the outcome, use `"Y"`; to specify covariates, use a character number corresponding to the desired position in X (e.g., `"1"`).
`ipc_scale`	what scale should the inverse probability weight correction be applied on (if any)? Defaults to "identity". (other options are "log" and "logit")
`ipc_weights`	weights for the computed influence curve (i.e., inverse probability weights for coarsened-at-random settings). Assumed to be already inverted (i.e., ipc_weights = 1 / [estimated probability weights]).
`ipc_est_type`	the type of procedure used for coarsened-at-random settings; options are "ipw" (for inverse probability weighting) or "aipw" (for augmented inverse probability weighting). Only used if `C` is not all equal to 1.
`scale_est`	should the point estimate be scaled to be greater than or equal to 0? Defaults to `TRUE`.
`nuisance_estimators_full`	(only used if `type = "average_value"`) a list of nuisance function estimators on the observed data (may be within a specified fold, for cross-fitted estimates). Specifically: an estimator of the optimal treatment rule; an estimator of the propensity score under the estimated optimal treatment rule; and an estimator of the outcome regression when treatment is assigned according to the estimated optimal rule.
`nuisance_estimators_reduced`	(only used if `type = "average_value"`) a list of nuisance function estimators on the observed data (may be within a specified fold, for cross-fitted estimates). Specifically: an estimator of the optimal treatment rule; an estimator of the propensity score under the estimated optimal treatment rule; and an estimator of the outcome regression when treatment is assigned according to the estimated optimal rule.
`exposure_name`	(only used if `type = "average_value"`) the name of the exposure of interest; binary, with 1 indicating presence of the exposure and 0 indicating absence of the exposure.
`cross_fitted_se`	should we use cross-fitting to estimate the standard errors (`TRUE`, the default) or not (`FALSE`)?
`bootstrap`	should bootstrap-based standard error estimates be computed? Defaults to `FALSE` (and currently may only be used if `sample_splitting = FALSE`).
`b`	the number of bootstrap replicates (only used if `bootstrap = TRUE` and `sample_splitting = FALSE`); defaults to 1000.
`boot_interval_type`	the type of bootstrap interval (one of `"norm"`, `"basic"`, `"stud"`, `"perc"`, or `"bca"`, as in `boot{boot.ci}`) if requested. Defaults to `"perc"`.
`clustered`	should the bootstrap resamples be performed on clusters rather than individual observations? Defaults to `FALSE`.
`cluster_id`	vector of the same length as `Y` giving the cluster IDs used for the clustered bootstrap, if `clustered` is `TRUE`.
`...`	other arguments to the estimation tool, see "See also".

Details

We define the population variable importance measure (VIM) for the group of features (or single feature) s with respect to the predictiveness measure V by

\psi_{0,s} := V(f_0, P_0) - V(f_{0,s}, P_0),

where f_0 is the population predictiveness maximizing function, f_{0,s} is the population predictiveness maximizing function that is only allowed to access the features with index not in s, and P_0 is the true data-generating distribution.

Cross-fitted VIM estimates are computed differently if sample-splitting is requested versus if it is not. We recommend using sample-splitting in most cases, since only in this case will inferences be valid if the variable(s) of interest have truly zero population importance. The purpose of cross-fitting is to estimate f_0 and f_{0,s} on independent data from estimating P_0; this can result in improved performance, especially when using flexible learning algorithms. The purpose of sample-splitting is to estimate f_0 and f_{0,s} on independent data; this allows valid inference under the null hypothesis of zero importance.

Without sample-splitting, cross-fitted VIM estimates are obtained by first splitting the data into K folds; then using each fold in turn as a hold-out set, constructing estimators f_{n,k} and f_{n,k,s} of f_0 and f_{0,s}, respectively on the training data and estimator P_{n,k} of P_0 using the test data; and finally, computing

\psi_{n,s} := K^{(-1)}\sum_{k=1}^K \{V(f_{n,k},P_{n,k}) - V(f_{n,k,s}, P_{n,k})\}.

With sample-splitting, cross-fitted VIM estimates are obtained by first splitting the data into 2K folds. These folds are further divided into 2 groups of folds. Then, for each fold k in the first group, estimator f_{n,k} of f_0 is constructed using all data besides the kth fold in the group (i.e., (2K - 1)/(2K) of the data) and estimator P_{n,k} of P_0 is constructed using the held-out data (i.e., 1/2K of the data); then, computing

v_{n,k} = V(f_{n,k},P_{n,k}).

Similarly, for each fold k in the second group, estimator f_{n,k,s} of f_{0,s} is constructed using all data besides the kth fold in the group (i.e., (2K - 1)/(2K) of the data) and estimator P_{n,k} of P_0 is constructed using the held-out data (i.e., 1/2K of the data); then, computing

v_{n,k,s} = V(f_{n,k,s},P_{n,k}).

Finally,

\psi_{n,s} := K^{(-1)}\sum_{k=1}^K \{v_{n,k} - v_{n,k,s}\}.

See the paper by Williamson, Gilbert, Simon, and Carone for more details on the mathematics behind the cv_vim function, and the validity of the confidence intervals.

In the interest of transparency, we return most of the calculations within the vim object. This results in a list including:

s: the column(s) to calculate variable importance for
SL.library: the library of learners passed to SuperLearner
full_fit: the fitted values of the chosen method fit to the full data (a list, for train and test data)
red_fit: the fitted values of the chosen method fit to the reduced data (a list, for train and test data)
est: the estimated variable importance
naive: the naive estimator of variable importance
eif: the estimated efficient influence function
eif_full: the estimated efficient influence function for the full regression
eif_reduced: the estimated efficient influence function for the reduced regression
se: the standard error for the estimated variable importance
ci: the (1-\alpha) \times 100% confidence interval for the variable importance estimate
test: a decision to either reject (TRUE) or not reject (FALSE) the null hypothesis, based on a conservative test
p_value: a p-value based on the same test as test
full_mod: the object returned by the estimation procedure for the full data regression (if applicable)
red_mod: the object returned by the estimation procedure for the reduced data regression (if applicable)
alpha: the level, for confidence interval calculation
sample_splitting_folds: the folds used for hypothesis testing
cross_fitting_folds: the folds used for cross-fitting
y: the outcome
ipc_weights: the weights
cluster_id: the cluster IDs
mat: a tibble with the estimate, SE, CI, hypothesis testing decision, and p-value

Value

An object of class vim. See Details for more information.

Examples

n <- 100
p <- 2
# generate the data
x <- data.frame(replicate(p, stats::runif(n, -5, 5)))

# apply the function to the x's
smooth <- (x[,1]/5)^2*(x[,1]+7)/5 + (x[,2]/3)^2

# generate Y ~ Normal (smooth, 1)
y <- as.matrix(smooth + stats::rnorm(n, 0, 1))

# set up a library for SuperLearner; note simple library for speed
library("SuperLearner")
learners <- c("SL.glm")

# -----------------------------------------
# using Super Learner (with a small number of folds, for illustration only)
# -----------------------------------------
set.seed(4747)
est <- cv_vim(Y = y, X = x, indx = 2, V = 2,
type = "r_squared", run_regression = TRUE,
SL.library = learners, cvControl = list(V = 2), alpha = 0.05)

# ------------------------------------------
# doing things by hand, and plugging them in
# (with a small number of folds, for illustration only)
# ------------------------------------------
# set up the folds
indx <- 2
V <- 2
Y <- matrix(y)
set.seed(4747)
# Note that the CV.SuperLearner should be run with an outer layer
# of 2*V folds (for V-fold cross-fitted importance)
full_cv_fit <- suppressWarnings(SuperLearner::CV.SuperLearner(
Y = Y, X = x, SL.library = learners, cvControl = list(V = 2 * V),
innerCvControl = list(list(V = V))
))
full_cv_preds <- full_cv_fit$SL.predict
# use the same cross-fitting folds for reduced
reduced_cv_fit <- suppressWarnings(SuperLearner::CV.SuperLearner(
    Y = Y, X = x[, -indx, drop = FALSE], SL.library = learners,
    cvControl = SuperLearner::SuperLearner.CV.control(
        V = 2 * V, validRows = full_cv_fit$folds
    ),
    innerCvControl = list(list(V = V))
))
reduced_cv_preds <- reduced_cv_fit$SL.predict
# for hypothesis testing
cross_fitting_folds <- get_cv_sl_folds(full_cv_fit$folds)
set.seed(1234)
sample_splitting_folds <- make_folds(unique(cross_fitting_folds), V = 2)
set.seed(5678)
est <- cv_vim(Y = y, cross_fitted_f1 = full_cv_preds,
cross_fitted_f2 = reduced_cv_preds, indx = 2, delta = 0, V = V, type = "r_squared",
cross_fitting_folds = cross_fitting_folds,
sample_splitting_folds = sample_splitting_folds,
run_regression = FALSE, alpha = 0.05, na.rm = TRUE)

[Package vimp version 2.3.3 Index]