get_pdp_predictions {easyalluvial} | R Documentation |
get predictions compatible with the partial dependence plotting method
Description
Alluvial plots are capable of displaying higher dimensional data on a plane, thus lend themselves to plot the response of a statistical model to changes in the input data across multiple dimensions. The practical limit here is 4 dimensions while conventional partial dependence plots are limited to 2 dimensions.
Briefly the 4 variables with the highest feature importance for a given model are selected and 5 values spread over the variable range are selected for each. Then a grid of all possible combinations is created. All none-plotted variables are set to the values found in the first row of the training data set. Using this artificial data space model predictions are being generated. This process is then repeated for each row in the training data set and the overall model response is averaged in the end. Each of the possible combinations is plotted as a flow which is coloured by the bin corresponding to the average model response generated by that particular combination.
Usage
get_pdp_predictions(
df,
imp,
m,
degree = 4,
bins = 5,
.f_predict = predict,
parallel = FALSE
)
Arguments
df |
dataframe, training data |
imp |
dataframe, with not more then two columns one of them numeric containing importance measures and one character or factor column containing corresponding variable names as found in training data. |
m |
model object |
degree |
integer, number of top important variables to select. For plotting more than 4 will result in two many flows and the alluvial plot will not be very readable, Default: 4 |
bins |
integer, number of bins for numeric variables, increasing this number might result in too many flows, Default: 5 |
.f_predict |
corresponding model predict() function. Needs to accept 'm' as the first parameter and use the 'newdata' parameter. Supply a wrapper for predict functions with x-y syntax. For parallel processing the predict method of object classes will not always get imported correctly to the worker environment. We can pass the correct predict method via this parameter for example randomForest:::predict.randomForest. Note that a lot of modeling packages do not export the predict method explicitly and it can only be found using :::. |
parallel |
logical, turn on parallel processing. Default: FALSE |
Details
For more on partial dependency plots see [https://christophm.github.io/interpretable-ml-book/pdp.html].
Value
vector, predictions
Parallel Processing
We are using 'furrr' and the 'future' package to paralelize some of the computational steps for calculating the predictions. It is up to the user to register a compatible backend (see plan).
Examples
df = mtcars2[, ! names(mtcars2) %in% 'ids' ]
m = randomForest::randomForest( disp ~ ., df)
imp = m$importance
pred = get_pdp_predictions(df, imp
, m
, degree = 3
, bins = 5)
# parallel processing --------------------------
## Not run:
future::plan("multisession")
# note that we have to pass the predict method via .f_predict otherwise
# it will not be available in the worker's environment.
pred = get_pdp_predictions(df, imp
, m
, degree = 3
, bins = 5,
, parallel = TRUE
, .f_predict = randomForest:::predict.randomForest)
## End(Not run)