orsf_vi {aorsf}R Documentation

ORSF variable importance

Description

Estimate the importance of individual variables using oblique random survival forests.

Usage

orsf_vi(object, group_factors = TRUE, importance = NULL, oobag_fun = NULL, ...)

orsf_vi_negate(object, group_factors = TRUE, oobag_fun = NULL, ...)

orsf_vi_permute(object, group_factors = TRUE, oobag_fun = NULL, ...)

orsf_vi_anova(object, group_factors = TRUE, ...)

Arguments

object

(orsf_fit) a trained oblique random survival forest (see orsf).

group_factors

(logical) if TRUE, the importance of factor variables will be reported overall by aggregating the importance of individual levels of the factor. If FALSE, the importance of individual factor levels will be returned.

importance

(character) Indicate method for variable importance:

  • 'anova': compute analysis of variance (ANOVA) importance

  • 'negate': compute negation importance

  • 'permute': compute permutation importance

oobag_fun

(function) to be used for evaluating out-of-bag prediction accuracy after negating coefficients (if importance = 'negate') or permuting the values of a predictor (if importance = 'permute')

  • When oobag_fun = NULL (the default), Harrell's C-statistic (1982) is used to evaluate accuracy.

  • if you use your own oobag_fun note the following:

    • oobag_fun should have two inputs: y_mat and s_vec

    • y_mat is a two column matrix with first column named 'time', second named 'status'

    • s_vec is a numeric vector containing predicted survival probabilities.

    • oobag_fun should return a numeric output of length 1

    • the same oobag_fun should have been used when you created object so that the initial value of out-of-bag prediction accuracy is consistent with the values that will be computed while variable importance is estimated.

For more details, see the out-of-bag vignette.

...

Further arguments passed to or from other methods (not currently used).

Details

When an orsf_fit object is fitted with importance = 'anova', 'negate', or 'permute', the output will have a vector of importance values based on the requested type of importance. However, you may still want to call orsf_vi() on this output if you want to group factor levels into one overall importance value.

orsf_vi() is a general purpose function to extract or compute variable importance estimates from an 'orsf_fit' object (see orsf). orsf_vi_negate(), orsf_vi_permute(), and orsf_vi_anova() are wrappers for orsf_vi(). The way these functions work depends on whether the object they are given already has variable importance estimates in it or not (see examples).

Value

orsf_vi functions return a named numeric vector.

The returned vector is sorted from highest to lowest value, with higher values indicating higher importance.

Variable importance methods

negation importance: Each variable is assessed separately by multiplying the variable's coefficients by -1 and then determining how much the model's performance changes. The worse the model's performance after negating coefficients for a given variable, the more important the variable. This technique is promising b/c it does not require permutation and it emphasizes variables with larger coefficients in linear combinations, but it is also relatively new and hasn't been studied as much as permutation importance. See Jaeger, 2022 for more details on this technique.

permutation importance: Each variable is assessed separately by randomly permuting the variable's values and then determining how much the model's performance changes. The worse the model's performance after permuting the values of a given variable, the more important the variable. This technique is flexible, intuitive, and frequently used. It also has several known limitations

analysis of variance (ANOVA) importance: A p-value is computed for each coefficient in each linear combination of variables in each decision tree. Importance for an individual predictor variable is the proportion of times a p-value for its coefficient is < 0.01. This technique is very efficient computationally, but may not be as effective as permutation or negation in terms of selecting signal over noise variables. See Menze, 2011 for more details on this technique.

Examples

ANOVA importance

The default variable importance technique, ANOVA, is calculated while you fit an ORSF ensemble.

fit <- orsf(pbc_orsf, Surv(time, status) ~ . - id)

fit
## ---------- Oblique random survival forest
## 
##      Linear combinations: Accelerated
##           N observations: 276
##                 N events: 111
##                  N trees: 500
##       N predictors total: 17
##    N predictors per node: 5
##  Average leaves per tree: 24
## Min observations in leaf: 5
##       Min events in leaf: 1
##           OOB stat value: 0.84
##            OOB stat type: Harrell's C-statistic
##      Variable importance: anova
## 
## -----------------------------------------

ANOVA is the default because it is fast, but it may not be as decisive as the permutation and negation techniques for variable selection.

Raw VI values

the ‘raw’ variable importance values are stored in the fit object

fit$importance
##     edema_1   ascites_1        bili      copper         age     albumin 
##  0.40000000  0.36013072  0.28140066  0.19806763  0.18608329  0.17480916 
##   edema_0.5     protime        chol       stage   spiders_1         ast 
##  0.15478339  0.15170120  0.14917270  0.13967665  0.13223663  0.11937944 
##    hepato_1       sex_f        trig    alk.phos    platelet trt_placebo 
##  0.11901034  0.10338744  0.09902514  0.09348659  0.07993689  0.06591549

these are ‘raw’ because values for factors have not been aggregated into a single value. Currently there is one value for k-1 levels of a k level factor. For example, you can see edema_1 and edema_0.5 in the importance values above because edema is a factor variable with levels of 0, 0.5, and 1.

Collapse VI across factor levels

To get aggregated values across all levels of each factor, use orsf_vi() with group_factors set to TRUE (the default)

orsf_vi(fit)
##    ascites       bili      edema     copper        age    albumin    protime 
## 0.36013072 0.28140066 0.25403773 0.19806763 0.18608329 0.17480916 0.15170120 
##       chol      stage    spiders        ast     hepato        sex       trig 
## 0.14917270 0.13967665 0.13223663 0.11937944 0.11901034 0.10338744 0.09902514 
##   alk.phos   platelet        trt 
## 0.09348659 0.07993689 0.06591549

Add VI to an ORSF

You can fit an ORSF without VI, then add VI later

fit_no_vi <- orsf(pbc_orsf,
                  Surv(time, status) ~ . - id,
                  importance = 'none')

# Note: you can't call orsf_vi_anova() on fit_no_vi because anova
# VI can only be computed while the forest is being grown.

orsf_vi_negate(fit_no_vi)
##          bili        copper           age       protime       albumin 
##  0.0873098562  0.0253698687  0.0242758908  0.0120337570  0.0086476349 
##       ascites         edema           ast          chol           sex 
##  0.0052094186  0.0036031812  0.0032298395  0.0029693686  0.0024484268 
##       spiders        hepato         stage      alk.phos          trig 
##  0.0023442384  0.0020316733  0.0015107314  0.0003646593 -0.0001562826 
##           trt      platelet 
## -0.0016149198 -0.0022400500
orsf_vi_permute(fit_no_vi)
##          bili           age        copper       protime         stage 
##  0.0155240675  0.0118774745  0.0062513024  0.0055740779  0.0043238175 
##       ascites       albumin        hepato       spiders          chol 
##  0.0041154407  0.0024484268  0.0017191081  0.0015628256  0.0015107314 
##         edema           sex           ast      alk.phos      platelet 
##  0.0014685599  0.0008856012  0.0005730360 -0.0002083767 -0.0003646593 
##           trt 
## -0.0016149198

ORSF and VI all at once

fit an ORSF and compute vi at the same time

fit_permute_vi <- orsf(pbc_orsf,
                        Surv(time, status) ~ . - id,
                        importance = 'permute')

# get the vi instantly (i.e., it doesn't need to be computed again)
orsf_vi_permute(fit_permute_vi)
##          bili           age         stage       albumin        copper 
##  0.0168785164  0.0097416128  0.0058345489  0.0045842884  0.0037507814 
##          chol           sex       ascites       protime         edema 
##  0.0035424047  0.0030735570  0.0028651802  0.0022400500  0.0021941575 
##           ast       spiders      platelet          trig        hepato 
##  0.0014586372  0.0007814128  0.0002083767 -0.0002604709 -0.0010418837 
##           trt      alk.phos 
## -0.0016149198 -0.0024484268

You can still get negation VI from this fit, but it needs to be computed

orsf_vi_negate(fit_permute_vi)
##         bili       copper          age      protime      albumin      ascites 
##  0.094186289  0.025213586  0.025161492  0.011929569  0.008543447  0.005626172 
##        stage         chol          ast          sex        edema      spiders 
##  0.005365701  0.004375912  0.004011252  0.003907064  0.003243483  0.002396333 
##     platelet     alk.phos       hepato          trt         trig 
##  0.001406543 -0.000833507 -0.001406543 -0.002240050 -0.003125651

References

Harrell FE, Califf RM, Pryor DB, Lee KL, Rosati RA. Evaluating the Yield of Medical Tests. JAMA 1982; 247(18):2543-2546. DOI: 10.1001/jama.1982.03320430047030

Breiman L. Random forests. Machine learning 2001 Oct; 45(1):5-32. DOI: 10.1023/A:1010933404324

Menze BH, Kelm BM, Splitthoff DN, Koethe U, Hamprecht FA. On oblique random forests. Joint European Conference on Machine Learning and Knowledge Discovery in Databases 2011 Sep 4; pp. 453-469. DOI: 10.1007/978-3-642-23783-6_29

Jaeger BC, Welden S, Lenoir K, Speiser JL, Segar MW, Pandey A, Pajewski NM. Accelerated and interpretable oblique random survival forests. arXiv e-prints 2022 Aug; arXiv-2208. URL: https://arxiv.org/abs/2208.01129


[Package aorsf version 0.0.4 Index]