R: Identify columns that are included in others

which_are_included {dataPreparation}

R Documentation

Identify columns that are included in others

Description

Find all the columns that don't contain more information than another column. For example if you have a column with an amount and another with the same amount but rounded, the second column is included in the first.

Usage

which_are_included(data_set, keep_cols = NULL, verbose = TRUE)

Arguments

`data_set`	Matrix, data.frame or data.table
`keep_cols`	List of columns not to drop (list of character, default to NULL)
`verbose`	Should the algorithm talk (logical, default to TRUE)

Details

This function is performing exponential search and is looking to every couple of columns.
Be very careful while using this function:
- if there is an id column, it will say everything is included in the id column;
- the order of columns will influence the result.

For example if you have a column with an amount and another with the same amount but rounded, the second column is included in the first.

And last but not least, with some machine learning algorithm it's not always smart to drop columns even if they don't give more info: the extreme example is the id example.

Value

A list of index of columns that have an exact duplicate in the data_set.

Examples

# Load toy data set
require(data.table)
data(tiny_messy_adult)

# Check for included columns
which_are_included(tiny_messy_adult)

# Return columns that are also constant, double and bijection
# Let's add a truly just included column
tiny_messy_adult$are50OrMore <- tiny_messy_adult$age > 50
which_are_included(tiny_messy_adult[, .(age, are50OrMore)])

# As one can, see this column that doesn't have additional info than age is spotted.

# But you should be careful, if there is a column id, every column will be dropped:
tiny_messy_adult$id = seq_len(nrow(tiny_messy_adult)) # build id
which_are_included(tiny_messy_adult)

[Package dataPreparation version 1.1.1 Index]