cor_select {collinear} | R Documentation |
Automated multicollinearity reduction via pairwise correlation
Description
Applies a recursive forward selection algorithm algorithm to select predictors with a bivariate correlation with any other predictor lower than a threshold defined by the argument max_cor
.
If the argument response
is provided, all non-numeric variables in predictors
are transformed into numeric using target encoding (see target_encoding_lab()
). Otherwise, non-numeric variables are ignored.
The argument preference_order
allows defining a preference selection order to preserve (when possible) variables that might be interesting or even required for a given analysis. If NULL, predictors are ordered from lower to higher sum of their absolute pairwise correlation with the other predictors.
For example, if predictors
is c("a", "b", "c")
and preference_order
is c("a", "b")
, there are two possibilities:
If the correlation between
"a"
and"b"
is belowmax_cor
, both variables are selected.If their correlation is equal or above
max_cor
, then"a"
is selected, no matter its correlation with"c"
,
If preference_order
is not provided, then the predictors are ranked by their variance inflation factor as computed by vif_df()
.
Usage
cor_select(
df = NULL,
response = NULL,
predictors = NULL,
preference_order = NULL,
cor_method = "pearson",
max_cor = 0.75,
encoding_method = "mean"
)
Arguments
df |
(required; data frame) A data frame with numeric and/or character predictors predictors, and optionally, a response variable. Default: NULL. |
response |
(recommended, character string) Name of a numeric response variable. Character response variables are ignored. Please, see 'Details' to better understand how providing this argument or not leads to different results when there are character variables in 'predictors'. Default: NULL. |
predictors |
(optional; character vector) Character vector with predictor names in 'df'. If omitted, all columns of 'df' are used as predictors. Default:'NULL' |
preference_order |
(optional; character vector) vector with column names in 'predictors' in the desired preference order, or result of the function |
cor_method |
(optional; character string) Method used to compute pairwise correlations. Accepted methods are "pearson" (with a recommended minimum of 30 rows in 'df') or "spearman" (with a recommended minimum of 10 rows in 'df'). Default: "pearson". |
max_cor |
(optional; numeric) Maximum correlation allowed between any pair of predictors. Higher values return larger number of predictors with higher multicollinearity. Default: 0.75 |
encoding_method |
(optional; character string). Name of the target encoding method to convert character and factor predictors to numeric. One of "mean", "rank", "loo", "rnorm" (see |
Value
Character vector with the names of the selected predictors.
Author(s)
Blas M. Benito
Examples
data(
vi,
vi_predictors
)
#subset to limit example run time
vi <- vi[1:1000, ]
vi_predictors <- vi_predictors[1:10]
#without response
#without preference_order
#permissive max_cor
selected.predictors <- cor_select(
df = vi,
predictors = vi_predictors,
max_cor = 0.8
)
selected.predictors
#without response
#without preference_order
#restrictive max_cor
selected.predictors <- cor_select(
df = vi,
predictors = vi_predictors,
max_cor = 0.5
)
selected.predictors
#with response
#without preference_order
#restrictive max_cor
#slightly different solution than previous one
#because here target encoding is done against the response
#while before was done pairwise against each numeric predictor
selected.predictors <- cor_select(
df = vi,
response = "vi_mean",
predictors = vi_predictors,
max_cor = 0.5
)
selected.predictors
#with response
#with user-defined preference_order
#restrictive max_cor
#numerics and categorical variables in output
selected.predictors <- cor_select(
df = vi,
response = "vi_mean",
predictors = vi_predictors,
preference_order = c(
"soil_type", #categorical variable
"soil_temperature_mean",
"swi_mean",
"rainfall_mean",
"evapotranspiration_mean"
),
max_cor = 0.5
)
selected.predictors
#with response
#with automated preference_order
#restrictive max_cor and max_vif
#numerics and categorical variables in output
preference.order <- preference_order(
df = vi,
response = "vi_mean",
predictors = vi_predictors,
f = f_rsquared #cor(response, predictor)
)
head(preference.order)
selected.predictors <- cor_select(
df = vi,
response = "vi_mean",
predictors = vi_predictors,
preference_order = preference.order,
max_cor = 0.5
)
selected.predictors