which_are_included {dataPreparation}R Documentation

Identify columns that are included in others

Description

Find all the columns that don't contain more information than another column. For example if you have a column with an amount and another with the same amount but rounded, the second column is included in the first.

Usage

which_are_included(data_set, keep_cols = NULL, verbose = TRUE)

Arguments

data_set

Matrix, data.frame or data.table

keep_cols

List of columns not to drop (list of character, default to NULL)

verbose

Should the algorithm talk (logical, default to TRUE)

Details

This function is performing exponential search and is looking to every couple of columns.
Be very careful while using this function:
- if there is an id column, it will say everything is included in the id column;
- the order of columns will influence the result.

For example if you have a column with an amount and another with the same amount but rounded, the second column is included in the first.

And last but not least, with some machine learning algorithm it's not always smart to drop columns even if they don't give more info: the extreme example is the id example.

Value

A list of index of columns that have an exact duplicate in the data_set.

Examples

# Load toy data set
require(data.table)
data(messy_adult)

# Reduce set size to save time (you can run it on full set)
messy_adult = messy_adult[seq_len(100), ]

# Check for included columns
which_are_included(messy_adult)

# Return columns that are also constant, double and bijection
# Let's add a truly just included column
messy_adult$are50OrMore <- messy_adult$age > 50
which_are_included(messy_adult[, .(age, are50OrMore)])

# As one can, see this column that doesn't have additional info than age is spotted.

# But you should be careful, if there is a column id, every column will be dropped:
messy_adult$id = seq_len(nrow(messy_adult)) # build id
which_are_included(messy_adult)

[Package dataPreparation version 1.0.4 Index]