orsf_vi {aorsf}  R Documentation 
Estimate the importance of individual variables using oblique random survival forests.
orsf_vi(object, group_factors = TRUE, importance = NULL, oobag_fun = NULL, ...)
orsf_vi_negate(object, group_factors = TRUE, oobag_fun = NULL, ...)
orsf_vi_permute(object, group_factors = TRUE, oobag_fun = NULL, ...)
orsf_vi_anova(object, group_factors = TRUE, ...)
object 
(orsf_fit) a trained oblique random survival forest (see orsf). 
group_factors 
(logical) if 
importance 
(character) Indicate method for variable importance:

oobag_fun 
(function) to be used for evaluating outofbag prediction accuracy after negating coefficients (if importance = 'negate') or permuting the values of a predictor (if importance = 'permute')
For more details, see the outofbag vignette. 
... 
Further arguments passed to or from other methods (not currently used). 
When an orsf_fit
object is fitted with importance = 'anova', 'negate', or
'permute', the output will have a vector of importance values based on
the requested type of importance. However, you may still want to call
orsf_vi()
on this output if you want to group factor levels into one
overall importance value.
orsf_vi()
is a general purpose function to extract or compute variable
importance estimates from an 'orsf_fit'
object (see orsf).
orsf_vi_negate()
, orsf_vi_permute()
, and orsf_vi_anova()
are wrappers
for orsf_vi()
. The way these functions work depends on whether the
object
they are given already has variable importance estimates in it
or not (see examples).
orsf_vi
functions return a named numeric vector.
Names of the vector are the predictor variables used by object
Values of the vector are the estimated importance of the given predictor.
The returned vector is sorted from highest to lowest value, with higher values indicating higher importance.
negation importance: Each variable is assessed separately by multiplying the variable's coefficients by 1 and then determining how much the model's performance changes. The worse the model's performance after negating coefficients for a given variable, the more important the variable. This technique is promising b/c it does not require permutation and it emphasizes variables with larger coefficients in linear combinations, but it is also relatively new and hasn't been studied as much as permutation importance. See Jaeger, 2022 for more details on this technique.
permutation importance: Each variable is assessed separately by randomly permuting the variable's values and then determining how much the model's performance changes. The worse the model's performance after permuting the values of a given variable, the more important the variable. This technique is flexible, intuitive, and frequently used. It also has several known limitations
analysis of variance (ANOVA) importance: A pvalue is computed for each coefficient in each linear combination of variables in each decision tree. Importance for an individual predictor variable is the proportion of times a pvalue for its coefficient is < 0.01. This technique is very efficient computationally, but may not be as effective as permutation or negation in terms of selecting signal over noise variables. See Menze, 2011 for more details on this technique.
The default variable importance technique, ANOVA, is calculated while you fit an ORSF ensemble.
fit < orsf(pbc_orsf, Surv(time, status) ~ .  id) fit
##  Oblique random survival forest ## ## Linear combinations: Accelerated ## N observations: 276 ## N events: 111 ## N trees: 500 ## N predictors total: 17 ## N predictors per node: 5 ## Average leaves per tree: 24 ## Min observations in leaf: 5 ## Min events in leaf: 1 ## OOB stat value: 0.84 ## OOB stat type: Harrell's Cstatistic ## Variable importance: anova ## ## 
ANOVA is the default because it is fast, but it may not be as decisive as the permutation and negation techniques for variable selection.
the ‘raw’ variable importance values are stored in the fit object
fit$importance
## edema_1 ascites_1 bili copper age albumin ## 0.40000000 0.36013072 0.28140066 0.19806763 0.18608329 0.17480916 ## edema_0.5 protime chol stage spiders_1 ast ## 0.15478339 0.15170120 0.14917270 0.13967665 0.13223663 0.11937944 ## hepato_1 sex_f trig alk.phos platelet trt_placebo ## 0.11901034 0.10338744 0.09902514 0.09348659 0.07993689 0.06591549
these are ‘raw’ because values for factors have not been aggregated into a single value. Currently there is one value for k1 levels of a k level factor. For example, you can see edema_1 and edema_0.5 in the importance values above because edema is a factor variable with levels of 0, 0.5, and 1.
To get aggregated values across all levels of each factor, use
orsf_vi()
with group_factors set to TRUE
(the default)
orsf_vi(fit)
## ascites bili edema copper age albumin protime ## 0.36013072 0.28140066 0.25403773 0.19806763 0.18608329 0.17480916 0.15170120 ## chol stage spiders ast hepato sex trig ## 0.14917270 0.13967665 0.13223663 0.11937944 0.11901034 0.10338744 0.09902514 ## alk.phos platelet trt ## 0.09348659 0.07993689 0.06591549
You can fit an ORSF without VI, then add VI later
fit_no_vi < orsf(pbc_orsf, Surv(time, status) ~ .  id, importance = 'none') # Note: you can't call orsf_vi_anova() on fit_no_vi because anova # VI can only be computed while the forest is being grown. orsf_vi_negate(fit_no_vi)
## bili copper age protime albumin ## 0.0873098562 0.0253698687 0.0242758908 0.0120337570 0.0086476349 ## ascites edema ast chol sex ## 0.0052094186 0.0036031812 0.0032298395 0.0029693686 0.0024484268 ## spiders hepato stage alk.phos trig ## 0.0023442384 0.0020316733 0.0015107314 0.0003646593 0.0001562826 ## trt platelet ## 0.0016149198 0.0022400500
orsf_vi_permute(fit_no_vi)
## bili age copper protime stage ## 0.0155240675 0.0118774745 0.0062513024 0.0055740779 0.0043238175 ## ascites albumin hepato spiders chol ## 0.0041154407 0.0024484268 0.0017191081 0.0015628256 0.0015107314 ## edema sex ast alk.phos platelet ## 0.0014685599 0.0008856012 0.0005730360 0.0002083767 0.0003646593 ## trt ## 0.0016149198
fit an ORSF and compute vi at the same time
fit_permute_vi < orsf(pbc_orsf, Surv(time, status) ~ .  id, importance = 'permute') # get the vi instantly (i.e., it doesn't need to be computed again) orsf_vi_permute(fit_permute_vi)
## bili age stage albumin copper ## 0.0168785164 0.0097416128 0.0058345489 0.0045842884 0.0037507814 ## chol sex ascites protime edema ## 0.0035424047 0.0030735570 0.0028651802 0.0022400500 0.0021941575 ## ast spiders platelet trig hepato ## 0.0014586372 0.0007814128 0.0002083767 0.0002604709 0.0010418837 ## trt alk.phos ## 0.0016149198 0.0024484268
You can still get negation VI from this fit, but it needs to be computed
orsf_vi_negate(fit_permute_vi)
## bili copper age protime albumin ascites ## 0.094186289 0.025213586 0.025161492 0.011929569 0.008543447 0.005626172 ## stage chol ast sex edema spiders ## 0.005365701 0.004375912 0.004011252 0.003907064 0.003243483 0.002396333 ## platelet alk.phos hepato trt trig ## 0.001406543 0.000833507 0.001406543 0.002240050 0.003125651
Harrell FE, Califf RM, Pryor DB, Lee KL, Rosati RA. Evaluating the Yield of Medical Tests. JAMA 1982; 247(18):25432546. DOI: 10.1001/jama.1982.03320430047030
Breiman L. Random forests. Machine learning 2001 Oct; 45(1):532. DOI: 10.1023/A:1010933404324
Menze BH, Kelm BM, Splitthoff DN, Koethe U, Hamprecht FA. On oblique random forests. Joint European Conference on Machine Learning and Knowledge Discovery in Databases 2011 Sep 4; pp. 453469. DOI: 10.1007/9783642237836_29
Jaeger BC, Welden S, Lenoir K, Speiser JL, Segar MW, Pandey A, Pajewski NM. Accelerated and interpretable oblique random survival forests. arXiv eprints 2022 Aug; arXiv2208. URL: https://arxiv.org/abs/2208.01129