R: Random Forest Permutation Importance for random forests

permimp {permimp}

R Documentation

Random Forest Permutation Importance for random forests

Description

Standard and partial/conditional permutation importance for random forest-objects fit using the party or randomForest packages, following the permutation principle of the 'mean decrease in accuracy' importance in randomForest . The partial/conditional permutation importance is implemented differently, selecting the predictions to condition on in each tree using Pearson Chi-squared tests applied to the by-split point-categorized predictors. In general the new implementation has similar results as the original varimp function. With asParty = TRUE, the partial/conditional permutation importance is fully backward-compatible but faster than the original varimp function in party.

Usage

 
permimp(object, ...)
## S3 method for class 'randomForest'
permimp(object, nperm = 1, OOB = TRUE, scaled = FALSE,
     conditional = FALSE, threshold = .95, whichxnames = NULL,   
     thresholdDiagnostics = FALSE, progressBar = TRUE,  do_check = TRUE, ...)
## S3 method for class 'RandomForest'
permimp(object, nperm = 1, OOB = TRUE, scaled = FALSE,
     conditional = FALSE, threshold = .95, whichxnames = NULL,   
     thresholdDiagnostics = FALSE, progressBar = TRUE, 
     pre1.0_0 = conditional, AUC = FALSE, asParty = FALSE, mincriterion = 0, ...)

Arguments

`object`	an object as returned by `cforest` or `randomForest`.
`mincriterion`	the value of the test statistic or 1 - p-value that must be exceeded in order to include a split in the computation of the importance. The default `mincriterion = 0` guarantees that all splits are included.
`conditional`	a logical that determines whether unconditional or conditional permutation is performed.
`threshold`	the threshold value for (1 - p-value) of the association between the predictor of interest and another predictor, which must be exceeded in order to include the other predictor in the conditioning scheme for the predictor of interest (only relevant if `conditional = TRUE`). A threshold value of zero includes all other predictors.
`nperm`	the number of permutations performed.
`OOB`	a logical that determines whether the importance is computed from the out-of-bag sample or the learning sample (not suggested).
`pre1.0_0`	Prior to party version 1.0-0, the actual data values were permuted according to the original permutation importance suggested by Breiman (2001). Now the assignments to child nodes of splits in the variable of interest are permuted as described by Hapfelmeier et al. (2012), which allows for missing values in the predictors and is more efficient with respect to memory consumption and computing time. This method does not apply to the conditional permutation importance, nor to random forests that were not fit using the party package.
`scaled`	a logical that determines whether the differences in prediction accuracy should be scaled by the total (null-model) error.
`AUC`	a logical that determines whether the Area Under the Curve (AUC) instead of the accuracy is used to compute the permutation importance (cf. Janitza et al., 2012). The AUC-based permutation importance is more robust towards class imbalance, but it is only applicable to binary classification.
`asParty`	a logical that determines whether or not exactly the same values as the original `varimp` function in party should be obtained.
`whichxnames`	a character vector containing the predictor variable names for which the permutation importance should be computed. Only use when aware of the implications, see section 'Details'.
`thresholdDiagnostics`	a logical that specifies whether diagnostics with respect to the threshold-value should be prompted as warnings.
`progressBar`	a logical that determines whether a progress bar should be displayed.
`do_check`	a logical that determines whether a check requiring user input should be included.
`...`	additional arguments to be passed to the Methods

Details

Function permimp is highly comparable to varimp in party, but the partial/conditional variable importance has a different, more efficient implementation. Compared to the original varimp in party, permimp applies a different strategy to select the predictors to condition on (ADD REFERENCE TO PAPER).

With asParty = TRUE, permimp returns exactly the same values as varimp in party, but the computation is done more efficiently.

If conditional = TRUE, the importance of each variable is computed by permuting within a grid defined by the predictors that are associated (with 1 - p-value greater than threshold) to the variable of interest. The threshold can be interpreted as a parameter that moves the permutation importance across a dimension from fully conditional (threshold = 0) to completely unconditional (threshold = 1), see Debeer and Strobl (2020).

Using the wichxnames argument, the computation of the permutation importance can be limited to a smaller number of specified predictors. Note, however, that when conditional = TRUE, the (other) predictors to condition on are also limited to this selection of predictors. Only use when fully aware of the implications.

For further details, please refer to the documentation of varimp.

Value

An object of class VarImp, with the mean decrease in accuracy as its $values.

References

Leo Breiman (2001). Random Forests. Machine Learning, 45(1), 5–32.

Alexander Hapfelmeier, Torsten Hothorn, Kurt Ulm, and Carolin Strobl (2012). A New Variable Importance Measure for Random Forests with Missing Data. Statistics and Computing, https://link.springer.com/article/10.1007/s11222-012-9349-1

Torsten Hothorn, Kurt Hornik, and Achim Zeileis (2006b). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15 (3), 651-674. Preprint available from https://www.zeileis.org/papers/Hothorn+Hornik+Zeileis-2006.pdf

Silke Janitza, Carolin Strobl and Anne-Laure Boulesteix (2013). An AUC-based Permutation Variable Importance Measure for Random Forests. BMC Bioinformatics.2013, 14 119. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-119

Carolin Strobl, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin, and Achim Zeileis (2008). Conditional Variable Importance for Random Forests. BMC Bioinformatics, 9, 307. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-307

Debeer Dries and Carolin Strobl (2020). Conditional Permutation Importance Revisited. BMC Bioinformatics, 21, 307. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03622-2

Examples

  
  ### for RandomForest-objects, by party::cforest()  
  set.seed(290875)
  readingSkills.cf <- party::cforest(score ~ ., data = party::readingSkills, 
                              control = party::cforest_unbiased(mtry = 2, ntree = 25))
  
  ### conditional importance, may take a while...
  # party implementation:
  set.seed(290875)
  party::varimp(readingSkills.cf, conditional = TRUE)
  # faster implementation but same results
  set.seed(290875)
  permimp(readingSkills.cf, conditional = TRUE, asParty = TRUE)
  
  # different implementation with similar results
  set.seed(290875)
  permimp(readingSkills.cf, conditional = TRUE, asParty = FALSE)
  
  ### standard (unconditional) importance is unchanged
  set.seed(290875)
  party::varimp(readingSkills.cf)
  set.seed(290875)
  permimp(readingSkills.cf)
  
  
  ###
  set.seed(290875)
  readingSkills.rf <- randomForest::randomForest(score ~ ., data = party::readingSkills, 
                              mtry = 2, ntree = 25, importance = TRUE, 
                              keep.forest = TRUE, keep.inbag = TRUE)
                              
    
  ### (unconditional) Permutation Importance
  set.seed(290875)
  permimp(readingSkills.rf, do_check = FALSE)
  
  # very close to
  readingSkills.rf$importance[,1]
  
  ### Conditional Permutation Importance
  set.seed(290875)
  permimp(readingSkills.rf, conditional = TRUE, threshold = .8, do_check = FALSE)

[Package permimp version 1.0-2 Index]