R: Correlation table

corr {lares}

R Documentation

Correlation table

Description

This function correlates a whole dataframe, running one hot smart encoding (ohse) to transform non-numerical features. Note that it will automatically suppress columns with less than 3 non missing values and warn the user.

Usage

corr(
  df,
  method = "pearson",
  use = "pairwise.complete.obs",
  pvalue = FALSE,
  padjust = NULL,
  half = FALSE,
  dec = 6,
  ignore = NULL,
  dummy = TRUE,
  redundant = NULL,
  logs = FALSE,
  limit = 10,
  top = NA,
  ...
)

Arguments

`df`	Dataframe. It doesn't matter if it's got non-numerical columns: they will be filtered.
`method`	Character. Any of: c("pearson", "kendall", "spearman").
`use`	Character. Method for computing covariances in the presence of missing values. Check `stats::cor` for options.
`pvalue`	Boolean. Returns a list, with correlations and statistical significance (p-value) for each value.
`padjust`	Character. NULL to skip or any of `p.adjust.methods` to calculate adjust p-values for multiple comparisons using `p.adjust()`.
`half`	Boolean. Return only half of the matrix? The redundant symmetrical correlations will be `NA`.
`dec`	Integer. Number of decimals to round correlations and p-values.
`ignore`	Vector or character. Which column should be ignored?
`dummy`	Boolean. Should One Hot (Smart) Encoding (`ohse()`) be applied to categorical columns?
`redundant`	Boolean. Should we keep redundant columns? i.e. If the column only has two different values, should we keep both new columns? Is set to `NULL`, only binary variables will dump redundant columns.
`logs`	Boolean. Calculate log(x)+1 for numerical columns?
`limit`	Integer. Limit one hot encoding to the n most frequent values of each column. Set to `NA` to ignore argument.
`top`	Integer. Select top N most relevant variables? Filtered and sorted by mean of each variable's correlations.
`...`	Additional parameters passed to `ohse`, `corr`, and/or `cor.test`.

Value

data.frame. Squared dimensions (N x N) to match every correlation between every df data.frame column/variable. Notice that when using ohse() you may get more dimensions.

Examples

data(dft) # Titanic dataset
df <- dft[, 2:5]

# Correlation matrix (without redundancy)
corr(df, half = TRUE)

# Ignore specific column
corr(df, ignore = "Pclass")

# Calculate p-values as well
corr(df, pvalue = TRUE, limit = 1)

# Test when no more than 2 non-missing values
df$trash <- c(1, rep(NA, nrow(df) - 1))
# and another method...
corr(df, method = "spearman")