which_are_included {dataPreparation} R Documentation

## Identify columns that are included in others

### Description

Find all the columns that don't contain more information than another column. For example if you have a column with an amount and another with the same amount but rounded, the second column is included in the first.

### Usage

which_are_included(data_set, keep_cols = NULL, verbose = TRUE)


### Arguments

 data_set Matrix, data.frame or data.table keep_cols List of columns not to drop (list of character, default to NULL) verbose Should the algorithm talk (logical, default to TRUE)

### Details

This function is performing exponential search and is looking to every couple of columns.
Be very careful while using this function:
- if there is an id column, it will say everything is included in the id column;
- the order of columns will influence the result.

For example if you have a column with an amount and another with the same amount but rounded, the second column is included in the first.

And last but not least, with some machine learning algorithm it's not always smart to drop columns even if they don't give more info: the extreme example is the id example.

### Value

A list of index of columns that have an exact duplicate in the data_set.

### Examples

# Load toy data set
require(data.table)

# Reduce set size to save time (you can run it on full set)

# Check for included columns

# Return columns that are also constant, double and bijection
# Let's add a truly just included column
messy_adult$are50OrMore <- messy_adult$age > 50