model.importance.plot {ModelMap} | R Documentation |
Compares the variable importance of two models with a back to back barchart.
Description
Takes two models and produces a back to back bar chart to compare the importance of the predictor variables. Models can be any combination of Random Forest or Stochastic Gradient Boosting, as long as both models have the same predictor variables.
Usage
model.importance.plot(model.obj.1 = NULL, model.obj.2 = NULL,
model.name.1 = "Model 1", model.name.2 = "Model 2", imp.type.1 = NULL,
imp.type.2 = NULL, type.label=TRUE, class.1 = NULL, class.2 = NULL,
quantile.1=NULL, quantile.2=NULL,
col.1="grey", col.2="black", scale.by = "sum", sort.by = "model.obj.1",
cf.mincriterion.1 = 0, cf.conditional.1 = FALSE, cf.threshold.1 = 0.2,
cf.nperm.1 = 1, cf.mincriterion.2 = 0, cf.conditional.2 = FALSE,
cf.threshold.2 = 0.2, cf.nperm.2 = 1, predList = NULL, folder = NULL,
PLOTfn = NULL, device.type = NULL, res=NULL, jpeg.res = 72,
device.width = 7, device.height = 7, units="in", pointsize=12,
cex=par()$cex,...)
Arguments
model.obj.1 |
| |||||||||||||||||||||||||||||||||||
model.obj.2 |
| |||||||||||||||||||||||||||||||||||
model.name.1 |
String. Label for left side of barchart. | |||||||||||||||||||||||||||||||||||
model.name.2 |
String. Label for right side of barchart. | |||||||||||||||||||||||||||||||||||
imp.type.1 |
Number. Type of importance to use for model 1. Importance type 1 is permutation based, as described in Breiman (2001). Importance type 2 is model based. For RF models is the decrease in node impurities attributable to each predictor variable. For SGB models, it is the reduction attributable to each variable in predicting the gradient on each iteration. Default for random forest models is | |||||||||||||||||||||||||||||||||||
imp.type.2 |
Number. Type of importance to use for model 2. Importance type 1 is permutation based, as described in Breiman (2001). Importance type 2 is model based. For RF models is the decrease in node impurities attributable to each predictor variable. For SGB models, it is the reduction attributable to each variable in predicting the gradient on each iteration. Default for random forest models is | |||||||||||||||||||||||||||||||||||
type.label |
Logical. Should axis labels include importance type for each side of plot. | |||||||||||||||||||||||||||||||||||
class.1 |
String. For binary and categorical random forest models. If the name a class is specified, the class-specific relative influence is used for plot. If | |||||||||||||||||||||||||||||||||||
class.2 |
String. For binary and categorical random forest models. If the name a class is specified, the class-specific relative influence is used for plot. If | |||||||||||||||||||||||||||||||||||
quantile.1 |
Numeric. QRF models. Quantile to use for model 1. Must be one of the quantiles used in building the QRF model. | |||||||||||||||||||||||||||||||||||
quantile.2 |
Numeric. QRF models. Quantile to use for model 2. Must be one of the quantiles used in building the QRF model. | |||||||||||||||||||||||||||||||||||
col.1 |
String. For binary and categorical random forest models. Color to use for bars for model 1. Defaults to grey. | |||||||||||||||||||||||||||||||||||
col.2 |
String. For binary and categorical random forest models. Color to use for bars for model 2. Defaults to black. | |||||||||||||||||||||||||||||||||||
scale.by |
String. Scale by: | |||||||||||||||||||||||||||||||||||
sort.by |
String. Sort by: | |||||||||||||||||||||||||||||||||||
cf.mincriterion.1 |
Number. CF models. The value of the test statistic or 1 - p-value that must be exceeded in order to include a split in the computation of the importance. The default | |||||||||||||||||||||||||||||||||||
cf.conditional.1 |
Logical. CF models. A logical determining whether unconditional or conditional computation of the importance is performed for | |||||||||||||||||||||||||||||||||||
cf.threshold.1 |
Number. CF models. The value of the test statistic or 1 - p-value of the association between the variable of interest and a covariate that must be exceeded inorder to include the covariate in the conditioning scheme for the variable of interest (only relevant if | |||||||||||||||||||||||||||||||||||
cf.nperm.1 |
Number. CF models. The number of permutations performed. | |||||||||||||||||||||||||||||||||||
cf.mincriterion.2 |
Number. CF models. The value of the test statistic or 1 - p-value that must be exceeded in order to include a split in the computation of the importance. The default | |||||||||||||||||||||||||||||||||||
cf.conditional.2 |
Logical. CF models. A logical determining whether unconditional or conditional computation of the importance is performed for | |||||||||||||||||||||||||||||||||||
cf.threshold.2 |
Number. CF models. The value of the test statistic or 1 - p-value of the association between the variable of interest and a covariate that must be exceeded inorder to include the covariate in the conditioning scheme for the variable of interest (only relevant if | |||||||||||||||||||||||||||||||||||
cf.nperm.2 |
Number. CF models. The number of permutations performed. | |||||||||||||||||||||||||||||||||||
predList |
String. A character vector of the predictor short names used to build the models. If | |||||||||||||||||||||||||||||||||||
folder |
String. The folder used for all output. Do not add ending slash to path string. If | |||||||||||||||||||||||||||||||||||
PLOTfn |
String. The file name to use to save the generated graphical plots. If | |||||||||||||||||||||||||||||||||||
device.type |
String or vector of strings. Model validation. One or more device types for graphical output from model validation diagnostics. Current choices:
| |||||||||||||||||||||||||||||||||||
res |
Integer. Model validation. Pixels per inch for jpeg, png, and tiff plots. The default is 72dpi, good for on screen viewing. For printing, suggested setting is 300dpi. | |||||||||||||||||||||||||||||||||||
jpeg.res |
Integer. Model validation. Deprecated. Ignored unless | |||||||||||||||||||||||||||||||||||
device.width |
Integer. Model validation. The device width for diagnostic plots in inches. | |||||||||||||||||||||||||||||||||||
device.height |
Integer. Model validation. The device height for diagnostic plots in inches. | |||||||||||||||||||||||||||||||||||
units |
Model validation. The units in which | |||||||||||||||||||||||||||||||||||
pointsize |
Integer. Model validation. The default pointsize of plotted text, interpreted as big points (1/72 inch) at | |||||||||||||||||||||||||||||||||||
cex |
Integer. Model validation. The cex for diagnostic plots. | |||||||||||||||||||||||||||||||||||
... |
Arguments to be passed to methods, such as graphical parameters (see |
Details
The importance measures used in this plot depend on the model type (RF verses SGB) and the response type (continuous, categorical, or binary).
Importance type 1 is permutation based, as described in Breiman (2001). Importance is calculated by randomly permuting each predictor variable and computing the associated reduction in predictive performance using Out Of Bag error for RF models and training error for SGB models. Note that for SGB models permutation based importance measures are still considered experimental. Importance type 2 is model based. For RF models, importance type 2 is calculated by the decrease in node impurities attributable to each predictor variable. For SGB models, importance type 2 is the reduction attributable to each variable in predicting the gradient on each iteration as described in described in Friedman (2001).
For RF models:
response type | type | Importance Measure | ||||
"continuous" | 1 | permutation | %IncMSE | |||
"binary" | 1 | permutation | Mean Decrease Accuracy | |||
"categorical" | 1 | permutation | Mean Decrease Accuracy | |||
"continuous" | 2 | node impurity | Residual sum of squares | |||
"binary" | 2 | node impurity | Mean Decrease Gini | |||
"categorical" | 2 | node impurity | Mean Decrease Gini |
For Random Forest models, if imp.type
not specified, importance type defaults to imp.type
of 1
- permutation importance. For SGB models, permutation importance is considered experimental so importance defaults to imp.type
of 2
- reduction of gradient of the loss function.
Also, for binary and categorical Random Forest models, class specific importance plots can be generated by the use of the class
argument. Note that class specific importance is only available for Random Forest models with importance type 1.
For CF models:
response type | type | Importance Measure | ||||
"continuous" | 1 | permutation | Mean Decrease Accuracy | |||
"binary" | 1 | permutation | Mean Decrease Accuracy | |||
"categorical" | 1 | permutation | Mean Decrease Accuracy | |||
"continuous" | 2 | node impurity | Not Available | |||
"binary" | 2 | node impurity | Mean Decrease in AUC | |||
"categorical" | 2 | node impurity | Not Available |
For binary CF models, ifimportance.type = 2, function uses AUC-based variables importances as described by Janitza et al. (2012). Here, the area under the curve instead of the accuracy is used to calculate the importance of each variable. This AUC-based variable importance measure is more robust towards class imbalance.
Also, for CF models, if cf.conditional = TRUE
, the importance of each variable is computed by permuting within a grid defined by the covariates that are associated (with 1 - p-value greater than threshold) to the variable of interest. The resulting variable importance score is conditional in the sense of beta coefficients in regression models, but represents the effect of a variable in both main effects and interactions. See Strobl et al. (2008) for details. Conditional improtance can be slow for large datasets.
Value
The function returns a two element list: IMP1
is the variable importance for model.obj.1
; and, IMP2
is the variable importance for model.obj.2
. This is mostly intended for CF models, where calculating the conditional importance can represent a considerable time investment. For other model types it would be just as easy to recalcuate importances on the fly as needed.
Note
Importance currently unavailable for QRF models.
Author(s)
Elizabeth Freeman
References
Breiman, L. (2001) Random Forests. Machine Learning, 45:5-32.
Alexander Hapfelmeier, Torsten Hothorn, Kurt Ulm, and Carolin Strobl (2012). A New Variable Importance Measure for Random Forests with Missing Data. Statistics and Computing, http://dx.doi.org/10.1007/s11222-012-9349-1
Torsten Hothorn, Kurt Hornik, and Achim Zeileis (2006b). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15 (3), 651-674. Preprint available from http://statmath.wu-wien.ac.at/~zeileis/papers/Hothorn+Hornik+Zeileis-2006.pdf
Silke Janitza, Carolin Strobl and Anne-Laure Boulesteix (2013). An AUC-based Permutation Variable Importance Measure for Random Forests. BMC Bioinformatics.2013, 14 119. http://www.biomedcentral.com/1471-2105/14/119
Carolin Strobl, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin, and Achim Zeileis (2008). Conditional Variable Importance for Random Forests. BMC Bioinformatics, 9, 307. http://www.biomedcentral.com/1471-2105/9/307
See Also
Examples
## Not run:
###########################################################################
############################# Run this set up code: #######################
###########################################################################
# set seed:
seed=38
# Define training and test files:
qdata.trainfn = system.file("extdata", "helpexamples","DATATRAIN.csv", package = "ModelMap")
# Define folder for all output:
folder=getwd()
#identifier for individual training and test data points
unique.rowname="ID"
##################################################################
########## Continuous Response, Continuous Predictors ############
##################################################################
#file names:
MODELfn.RF="RF_Bio_TC"
#predictors:
predList=c("TCB","TCG","TCW")
#define which predictors are categorical:
predFactor=FALSE
# Response name and type:
response.name="BIO"
response.type="continuous"
########## Build Models #################################
model.obj.RF = model.build( model.type="RF",
qdata.trainfn=qdata.trainfn,
folder=folder,
unique.rowname=unique.rowname,
MODELfn=MODELfn.RF,
predList=predList,
predFactor=predFactor,
response.name=response.name,
response.type=response.type,
seed=seed
)
########## Make Imortance Plot - RF Importance type 1 vs 2 #######
model.importance.plot( model.obj.1=model.obj.RF,
model.obj.2=model.obj.RF,
model.name.1="PercentIncMSE",
model.name.2="IncNodePurity",
imp.type.1=1,
imp.type.2=2,
scale.by="sum",
sort.by="predList",
predList=predList,
main="Imp type 1 vs Imp type 2",
device.type="default")
##################################################################
########## Categorical Response, Continuous Predictors ###########
##################################################################
file name:
MODELfn="RF_NLCD_TC"
predictors:
predList=c("TCB","TCG","TCW")
define which predictors are categorical:
predFactor=FALSE
Response name and type:
response.name="NLCD"
response.type="categorical"
########## Build Model #################################
model.obj.NLCD = model.build( model.type="RF",
qdata.trainfn=qdata.trainfn,
folder=folder,
unique.rowname=unique.rowname,
MODELfn=MODELfn,
predList=predList,
predFactor=predFactor,
response.name=response.name,
response.type=response.type,
seed=seed)
############## Make Imortance Plot ###################
model.importance.plot( model.obj.1=model.obj.NLCD,
model.obj.2=model.obj.NLCD,
model.name.1="NLCD=41",
model.name.2="NLCD=42",
class.1="41",
class.2="42",
scale.by="sum",
sort.by="predList",
predList=predList,
main="Class 41 vs. Class 42",
device.type="default")
##################################################################
############## Conditonal inference forest models ################
##################################################################
#predictors:
predList=c("TCB","TCG","TCW","NLCD")
#define which predictors are categorical:
predFactor=c("NLCD")
#binary response
response.name="CONIFTYP"
response.type="binary"
MODELfn.CF="CF_CONIFTYP_TCandNLCD"
####################### Build Model ##############################
model.obj.CF = model.build( model.type="CF",
qdata.trainfn=qdata.trainfn,
folder=folder,
unique.rowname=unique.rowname,
MODELfn=MODELfn.CF,
predList=predList,
predFactor=predFactor,
response.name=response.name,
response.type=response.type,
seed=seed
)
################## Make Imortance Plot ##########################
#Conditional vs. Unconditional importance#
model.importance.plot( model.obj.1=model.obj.CF,
model.obj.2=model.obj.CF,
model.name.1="conditional",
model.name.2="unconditional",
imp.type.1=1,
imp.type.2=1,
cf.conditional.1=TRUE,
cf.conditional.2=FALSE,
scale.by="sum",
sort.by="predList",
predList=predList,
main="Conditional verses Unconditional",
device.type="default"
)
## End(Not run) # end dontrun