R: Correlation data frame of numeric and character variables

cor_df {collinear}

R Documentation

Correlation data frame of numeric and character variables

Description

Returns a correlation data frame between all pairs of predictors in a training dataset. Non-numeric predictors are transformed into numeric via target encoding, using the 'response' variable as reference.

Usage

cor_df(
  df = NULL,
  response = NULL,
  predictors = NULL,
  cor_method = "pearson",
  encoding_method = "mean"
)

Arguments

`df`	(required; data frame) A data frame with numeric and/or character predictors, and optionally, a response variable. Default: NULL.
`response`	(recommended, character string) Name of a numeric response variable. Character response variables are ignored. Please, see 'Details' to better understand how providing this argument or not leads to different results when there are character variables in 'predictors'. Default: NULL.
`predictors`	(optional; character vector) character vector with predictor names in 'df'. If omitted, all columns of 'df' are used as predictors. Default:'NULL'
`cor_method`	(optional; character string) Method used to compute pairwise correlations. Accepted methods are "pearson" (with a recommended minimum of 30 rows in 'df') or "spearman" (with a recommended minimum of 10 rows in 'df'). Default: "pearson".
`encoding_method`	(optional; character string). Name of the target encoding method to convert character and factor predictors to numeric. One of "mean", "rank", "loo", "rnorm" (see `target_encoding_lab()` for further details). Default: "mean"

Details

This function attempts to handle correlations between pairs of variables that can be of different types:

numeric vs. numeric: computed with stats::cor() with the methods "pearson" or "spearman".
numeric vs. character, two alternatives leading to different results:
- 'response' is provided: the character variable is target-encoded as numeric using the values of the response as reference, and then its correlation with the numeric variable is computed with stats::cor(). This option generates a response-specific result suitable for training statistical and machine-learning models
- 'response' is NULL (or the name of a non-numeric column): the character variable is target-encoded as numeric using the values of the numeric predictor (instead of the response) as reference, and then their correlation is computed with stats::cor(). This option leads to a response-agnostic result suitable for clustering problems.
character vs. character, two alternatives leading to different results:
- 'response' is provided: the character variables are target-encoded as numeric using the values of the response as reference, and then their correlation is computed with stats::cor().
- response' is NULL (or the name of a non-numeric column): the association between the character variables is computed using Cramer's V. This option might be problematic, because R-squared values and Cramer's V, even when having the same range between 0 and 1, are not fully comparable.

Value

data frame with pairs of predictors and their correlation.

Author(s)

Blas M. Benito

Examples


data(
  vi,
  vi_predictors
)

#reduce size of vi to speed-up example execution
vi <- vi[1:1000, ]
vi_predictors <- vi_predictors[1:10]

#without response
#categorical vs categorical compared with cramer_v()
#categorical vs numerical compared wit stats::cor() via target-encoding
#numerical vs numerical compared with stats::cor()
df <- cor_df(
  df = vi,
  predictors = vi_predictors
)

head(df)

#with response
#different solution than previous one
#because target encoding is done against the response
#rather than against the other numeric in the pair
df <- cor_df(
  df = vi,
  response = "vi_mean",
  predictors = vi_predictors
)

head(df)

[Package collinear version 1.1.1 Index]