the_feature_engineer {spatialRF} | R Documentation |
Suggest variable interactions and composite features for random forest models
Description
Suggests candidate variable interactions and composite features able to improve predictive accuracy over data not used to train the model via spatial cross-validation with rf_evaluate()
. For a pair of predictors a
and b
, interactions are build via multiplication (a * b
), while composite features are built by extracting the first factor of a principal component analysis performed with pca()
, after rescaling a
and b
between 1 and 100. Interactions and composite features are named a..x..b
and a..pca..b
respectively.
Candidate variables a
and b
are selected from those predictors in predictor.variable.names
with a variable importance above importance.threshold
(set by default to the median of the importance scores).
For each interaction and composite feature, a model including all the predictors plus the interaction or composite feature is fitted, and it's R squared (or AUC if the response is binary) computed via spatial cross-validation (see rf_evaluate()
) is compared with the R squared of the model without interactions or composite features.
From all the potential interactions screened, only those with a positive increase in R squared (or AUC when the response is binomial) of the model, a variable importance above the median, and a maximum correlation among themselves and with the predictors in predictor.variable.names
not higher than cor.threshold
(set to 0.5 by default) are selected. Such a restrictive set of rules ensures that the selected interactions can be used right away for modeling purposes without increasing model complexity unnecessarily. However, the suggested variable interactions might not make sense from a domain expertise standpoint, so please, examine them with care.
The function returns the criteria used to select the interactions, and the data required to use these interactions a model.
Usage
the_feature_engineer(
data = NULL,
dependent.variable.name = NULL,
predictor.variable.names = NULL,
xy = NULL,
ranger.arguments = NULL,
repetitions = 30,
training.fraction = 0.75,
importance.threshold = 0.75,
cor.threshold = 0.75,
point.color = viridis::viridis(100, option = "F", alpha = 0.8),
seed = NULL,
verbose = TRUE,
n.cores = parallel::detectCores() - 1,
cluster = NULL
)
Arguments
data |
Data frame with a response variable and a set of predictors. Default: |
dependent.variable.name |
Character string with the name of the response variable. Must be in the column names of |
predictor.variable.names |
Character vector with the names of the predictive variables, or object of class |
xy |
Data frame or matrix with two columns containing coordinates and named "x" and "y". If not provided, the comparison between models with and without variable interactions is not done. |
ranger.arguments |
Named list with ranger arguments (other arguments of this function can also go here). All ranger arguments are set to their default values except for 'importance', that is set to 'permutation' rather than 'none'. Please, consult the help file of ranger if you are not familiar with the arguments of this function. |
repetitions |
Integer, number of spatial folds to use during cross-validation. Must be lower than the total number of rows available in the model's data. Default: |
training.fraction |
Proportion between 0.5 and 0.9 indicating the proportion of records to be used as training set during spatial cross-validation. Default: |
importance.threshold |
Numeric between 0 and 1, quantile of variable importance scores over which to select individual predictors to explore interactions among them. Larger values reduce the number of potential interactions explored. Default: |
cor.threshold |
Numeric, maximum Pearson correlation between any pair of the selected interactions, and between any interaction and the predictors in |
point.color |
Colors of the plotted points. Can be a single color name (e.g. "red4"), a character vector with hexadecimal codes (e.g. "#440154FF" "#21908CFF" "#FDE725FF"), or function generating a palette (e.g. |
seed |
Integer, random seed to facilitate reproduciblity. If set to a given number, the results of the function are always the same. Default: |
verbose |
Logical. If |
n.cores |
Integer, number of cores to use for parallel execution. Creates a socket cluster with |
cluster |
A cluster definition generated with |
Value
A list with seven slots:
-
screening
: Data frame with selection scores of all the interactions considered. -
selected
: Data frame with selection scores of the selected interactions. -
df
: Data frame with the computed interactions. -
plot
: List of plots of the selected interactions versus the response variable. The output list can be plotted all at once withpatchwork::wrap_plots(p)
orcowplot::plot_grid(plotlist = p)
, or one by one by extracting each plot from the list. -
data
: Data frame with the response variable, the predictors, and the selected interactions, ready to be used asdata
argument in the package functions. -
dependent.variable.name
: Character, name of the response. -
predictor.variable.names
: Character vector with the names of the predictors and the selected interactions.
Examples
if(interactive()){
#load example data
data(plant_richness_df)
new.features <- the_feature_engineer(
data = plant_richness_df,
dependent.variable.name = "richness_species_vascular",
predictor.variable.names = colnames(plant_richness_df)[5:21],
n.cores = 1,
verbose = TRUE
)
new.features$screening
new.features$selected
new.features$columns
}