preference_order {collinear} | R Documentation |
Compute the preference order for predictors based on a user-defined function.
Description
This function calculates the preference order of predictors based on a user-provided function that takes a predictor, a response, and a data frame as arguments.
Usage
preference_order(
df = NULL,
response = NULL,
predictors = NULL,
f = f_rsquared,
encoding_method = "mean",
workers = 1
)
Arguments
df |
(required; data frame) A data frame with numeric and/or character predictors predictors, and optionally, a response variable. Default: NULL. |
response |
(required, character string) Name of a numeric response variable. Character response variables are ignored. Please, see 'Details' to better understand how providing this argument or not leads to different results when there are character variables in 'predictors'. Default: NULL. |
predictors |
(optional; character vector) character vector with predictor names in 'df'. If omitted, all columns of 'df' are used as predictors. Default:'NULL' |
f |
(optional: function) A function that returns a value representing the relationship between a given predictor and the response. Higher values are ranked higher. The available options are:
|
encoding_method |
(optional; character string). Name of the target encoding method to convert character and factor predictors to numeric. One of "mean", "rank", "loo", "rnorm" (see |
workers |
(integer) number of workers for parallel execution. Default: 1 |
Value
A data frame with the columns "predictor" and "value". The former contains the predictors names in order, ready for the argument preference_order
in cor_select()
, vif_select()
and collinear()
. The latter contains the result of the function f
for each combination of predictor and response.
Author(s)
Blas M. Benito
Examples
data(
vi,
vi_predictors
)
#subset to limit example run time
vi <- vi[1:1000, ]
#computing preference order
#with response
#numeric and categorical predictors in the output
#as the R-squared between each predictor and the response
preference.order <- preference_order(
df = vi,
response = "vi_mean",
predictors = vi_predictors,
f = f_rsquared,
workers = 1
)
preference.order
#using it in variable selection with collinear()
selected.predictors <- cor_select(
df = vi,
response = "vi_mean", #don't forget the response!
predictors = vi_predictors,
preference_order = preference.order,
max_cor = 0.75
)
selected.predictors
#check their correlations
selected.predictors.cor <- cor_df(
df = vi,
response = "vi_mean",
predictors = selected.predictors
)
#all correlations below max_cor
selected.predictors.cor
#USING A CUSTOM FUNCTION
#custom function to compute RMSE between a predictor and a response
#x is a predictor name
#y is a response name
#df is a data frame with multiple predictors and one response
#must return a single number, with higher number indicating higher preference
#notice we use "one minus RMSE" to give higher rank to variables with lower RMSE
f_rmse <- function(x, y, df){
xy <- df[, c(x, y)] |>
na.omit() |>
scale()
1 - sqrt(mean((xy[, 1] - xy[, 2])^2))
}
preference.order <- preference_order(
df = vi,
response = "vi_mean",
predictors = vi_predictors,
f = f_rmse,
workers = 1
)
preference.order