vi_permute {vip} | R Documentation |
Permutation-based variable importance
Compute permutation-based variable importance scores for the predictors in a model; for details on the algorithm, see Greenwell and Boehmke (2020).
vi_permute(object, ...)
## Default S3 method:
feature_names = NULL,
train = NULL,
target = NULL,
metric = NULL,
smaller_is_better = NULL,
type = c("difference", "ratio"),
nsim = 1,
keep = TRUE,
sample_size = NULL,
sample_frac = NULL,
reference_class = NULL,
event_level = NULL,
pred_wrapper = NULL,
verbose = FALSE,
parallel = FALSE,
parallelize_by = c("features", "repetitions"),
object |
A fitted model object (e.g., a randomForest object). |
... |
Additional optional arguments to be passed on to
foreach (e.g., |
feature_names |
Character string giving the names of the predictor
variables (i.e., features) of interest. If |
train |
A matrix-like R object (e.g., a data frame or matrix)
containing the training data. If |
target |
Either a character string giving the name (or position) of the
target column in |
metric |
Either a function or character string specifying the
performance metric to use in computing model performance (e.g., RMSE for
regression or accuracy for binary classification). If |
smaller_is_better |
Logical indicating whether or not a smaller value
of |
type |
Character string specifying how to compare the baseline and
permuted performance metrics. Current options are |
nsim |
Integer specifying the number of Monte Carlo replications to
perform. Default is 1. If |
keep |
Logical indicating whether or not to keep the individual
permutation scores for all |
sample_size |
Integer specifying the size of the random sample to use
for each Monte Carlo repetition. Default is |
sample_frac |
Proportion specifying the size of the random sample to use
for each Monte Carlo repetition. Default is |
reference_class |
Deprecated, use |
event_level |
String specifying which factor level of |
pred_wrapper |
Prediction function that requires two arguments,
verbose |
Logical indicating whether or not to print information during
the construction of variable importance scores. Default is |
parallel |
Logical indicating whether or not to run |
parallelize_by |
Character string specifying whether to parallelize
across features ( |
A tidy data frame (i.e., a tibble object) with two columns:
- the corresponding feature name; -
- the associated importance, computed as the average change in performance after a random permutation (or permutations, ifnsim > 1
) of the feature in question.
If nsim > 1
, then an additional column (StDev
) containing the standard
deviation of the individual permutation scores for each feature is also
returned; this helps assess the stability/variation of the individual
permutation importance for each feature.
Brandon M. Greenwell and Bradley C. Boehmke, The R Journal (2020) 12:1, pages 343-366.
## Not run:
# Regression example
library(ranger) # for fitting random forests
# Simulate data from Friedman 1 benchmark; only x1-x5 are important!
trn <- gen_friedman(500, seed = 101) # ?vip::gen_friedman
# Prediction wrapper
pfun <- function(object, newdata) {
# Needs to return vector of predictions from a ranger object; see
# `ranger::predcit.ranger` for details on making predictions
predict(object, data = newdata)$predictions
# Fit a (default) random forest
set.seed(0803) # for reproducibility
rfo <- ranger(y ~ ., data = trn)
# Compute permutation-based VI scores
set.seed(2021) # for reproducibility
vis <- vi(rfo, method = "permute", target = "y", metric = "rsq",
pred_wrapper = pfun, train = trn)
# Same as above, but using `vi_permute()` directly
set.seed(2021) # for reproducibility
vi_permute(rfo, target = "y", metric = "rsq", pred_wrapper = pfun
train = trn)
# Plot VI scores (could also replace `vi()` with `vip()` in above example)
vip(vis, include_type = TRUE)
# Mean absolute error
mae <- function(truth, estimate) {
mean(abs(truth - estimate))
# Permutation-based VIP with user-defined MAE metric
set.seed(1101) # for reproducibility
vi_permute(rfo, target = "y", metric = mae, smaller_is_better = TRUE,
pred_wrapper = pfun, train = trn)
# Same as above, but using `yardstick` package instead of user-defined metric
set.seed(1101) # for reproducibility
vi_permute(rfo, target = "y", metric = yardstick::mae_vec,
smaller_is_better = TRUE, pred_wrapper = pfun, train = trn)
# Classification (binary) example
library(randomForest) # another package for fitting random forests
# Complete (i.e., imputed version of titanic data); see `?vip::titanic_mice`
head(t1 <- titanic_mice[[1L]])
t1$pclass <- as.ordered(t1$pclass) # makes more sense as an ordered factor
# Fit another (default) random forest
set.seed(2053) # for reproducibility
(rfo2 <- randomForest(survived ~ ., data = t1))
# Define prediction wrapper for predicting class labels from a
# "randomForest" object
pfun_class <- function(object, newdata) {
# Needs to return factor of classifications
predict(object, newdata = newdata, type = "response")
# Sanity check
pfun_class(rfo2, newdata = head(t1))
## 1 2 3 4 5 6
## yes yes yes no yes no
## Levels: no yes
# Compute mean decrease in accuracy
set.seed(1359) # for reproducibility
method = "permute",
train = t1,
target = "survived",
metric = "accuracy", # or pass in `yardstick::accuracy_vec` directly
# smaller_is_better = FALSE, # no need to set for built-in metrics
pred_wrapper = pfun_class,
nsim = 30 # use 30 repetitions
## # A tibble: 5 × 3
## Variable Importance StDev
## <chr> <dbl> <dbl>
## 1 sex 0.228 0.0110
## 2 pclass 0.0825 0.00505
## 3 age 0.0721 0.00557
## 4 sibsp 0.0346 0.00430
## 5 parch 0.0183 0.00236
# Define prediction wrapper for predicting class probabilities from a
# "randomForest" object
pfun_prob <- function(object, newdata) {
# Needs to return vector of class probabilities for event level of interest
predict(object, newdata = newdata, type = "prob")[, "yes"]
# Sanity check
pfun_prob(rfo2, newdata = head(t1)) # estiated P(survived=yes | x)
## 1 2 3 4 5 6
## 0.990 0.864 0.486 0.282 0.630 0.078
# Compute mean increase in Brier score
set.seed(1411) # for reproducibility
method = "permute",
train = t1,
target = "survived",
metric = yardstick::brier_class_vec, # or pass in `"brier"` directly
smaller_is_better = FALSE, # need to set when supplying a function
pred_wrapper = pfun_prob,
nsim = 30 # use 30 repetitions
## # A tibble: 5 × 3
## Variable Importance StDev
## <chr> <dbl> <dbl>
## 1 sex 0.210 0.00869
## 2 pclass 0.0992 0.00462
## 3 age 0.0970 0.00469
## 4 parch 0.0547 0.00273
## 5 sibsp 0.0422 0.00200
# Some metrics, like AUROC, treat one class as the "event" of interest. In
# such cases, it's important to make sure the event level (which typically
# defaults to which ever event class comes first in alphabetical order)
# matches the event class that corresponds to the prediction wrappers
# returned probabilities. To do this, you can (and should) set the
# `event_class` argument. For instance, our prediction wrapper specified
# `survived = "yes"` as the event of interest, but this is considered the
# second event:
## [1] "no" "yes"
# So, we need to specify the second class as the event of interest via the
# `event_level` argument (otherwise, we would get the negative of the results
# we were hoping for; a telltale sign the event level and prediction wrapper
do not match)
set.seed(1413) # for reproducibility
method = "permute",
train = t1,
target = "survived",
metric = "roc_auc",
event_level = "second", # use "yes" as class label/"event" of interest
pred_wrapper = pfun_prob,
nsim = 30 # use 30 repetitions
## # A tibble: 5 × 3
## Variable Importance StDev
## <chr> <dbl> <dbl>
## 1 sex 0.229 0.0137
## 2 pclass 0.0920 0.00533
## 3 age 0.0850 0.00477
## 4 sibsp 0.0283 0.00211
## 5 parch 0.0251 0.00351
## End(Not run)