corr {lares} | R Documentation |
Correlation table
Description
This function correlates a whole dataframe, running one hot smart
encoding (ohse
) to transform non-numerical features.
Note that it will automatically suppress columns
with less than 3 non missing values and warn the user.
Usage
corr(
df,
method = "pearson",
use = "pairwise.complete.obs",
pvalue = FALSE,
padjust = NULL,
half = FALSE,
dec = 6,
ignore = NULL,
dummy = TRUE,
redundant = NULL,
logs = FALSE,
limit = 10,
top = NA,
...
)
Arguments
df |
Dataframe. It doesn't matter if it's got non-numerical columns: they will be filtered. |
method |
Character. Any of: c("pearson", "kendall", "spearman"). |
use |
Character. Method for computing covariances in the presence
of missing values. Check |
pvalue |
Boolean. Returns a list, with correlations and statistical significance (p-value) for each value. |
padjust |
Character. NULL to skip or any of |
half |
Boolean. Return only half of the matrix? The redundant
symmetrical correlations will be |
dec |
Integer. Number of decimals to round correlations and p-values. |
ignore |
Vector or character. Which column should be ignored? |
dummy |
Boolean. Should One Hot (Smart) Encoding ( |
redundant |
Boolean. Should we keep redundant columns? i.e. If the
column only has two different values, should we keep both new columns?
Is set to |
logs |
Boolean. Calculate log(x)+1 for numerical columns? |
limit |
Integer. Limit one hot encoding to the n most frequent
values of each column. Set to |
top |
Integer. Select top N most relevant variables? Filtered and sorted by mean of each variable's correlations. |
... |
Additional parameters passed to |
Value
data.frame. Squared dimensions (N x N) to match every
correlation between every df
data.frame column/variable. Notice
that when using ohse()
you may get more dimensions.
See Also
Other Calculus:
dist2d()
,
model_metrics()
,
quants()
Other Correlations:
corr_cross()
,
corr_var()
Examples
data(dft) # Titanic dataset
df <- dft[, 2:5]
# Correlation matrix (without redundancy)
corr(df, half = TRUE)
# Ignore specific column
corr(df, ignore = "Pclass")
# Calculate p-values as well
corr(df, pvalue = TRUE, limit = 1)
# Test when no more than 2 non-missing values
df$trash <- c(1, rep(NA, nrow(df) - 1))
# and another method...
corr(df, method = "spearman")