R: Identify double columns

which_are_in_double {dataPreparation}

R Documentation

Identify double columns

Description

Find all the columns that are in double.

Usage

which_are_in_double(data_set, keep_cols = NULL, verbose = TRUE)

Arguments

`data_set`	Matrix, data.frame or data.table
`keep_cols`	List of columns not to drop (list of character, default to NULL)
`verbose`	Should the algorithm talk (logical, default to TRUE)

Details

This function is performing search by looking to every couple of columns. First it compares the first 10 lines of both columns. If they are not equal then the columns aren't identical, else it compares lines 11 to 100; then 101 to 1000... So this function is fast with data_set set with a large number of lines and a lot of columns that aren't equals.
If verbose is TRUE, the column logged will be the one returned.

Value

A list of index of columns that have an exact duplicate in the data_set set. Ex: if column i and column j (with j > i) are equal it will return j.

Examples

# First let's build a matrix with 3 columns and a lot of lines, with 1's everywhere
M <- matrix(1, nrow = 1e6, ncol = 3)

# Now let's check which columns are equals
which_are_in_double(M)
# It return 2 and 3: you should only keep column 1.

# Let's change the column 2, line 1 to 0. And check again
M[1, 2] <- 0
which_are_in_double(M)
# It only returns 3

# What about NA? NA vs not NA => not equal
M[1, 2] <- NA
which_are_in_double(M)
# It only returns 3

# What about NA?  Na vs NA => yep it's the same
M[1, 1] <- NA
which_are_in_double(M)
# It only returns 2

[Package dataPreparation version 1.1.1 Index]