which_are_in_double {dataPreparation} | R Documentation |
Find all the columns that are in double.
which_are_in_double(data_set, keep_cols = NULL, verbose = TRUE)
data_set |
Matrix, data.frame or data.table |
keep_cols |
List of columns not to drop (list of character, default to NULL) |
verbose |
Should the algorithm talk (logical, default to TRUE) |
This function is performing search by looking to every couple of columns. First it compares the
first 10 lines of both columns. If they are not equal then the columns aren't identical, else
it compares lines 11 to 100; then 101 to 1000... So this function is fast with data_set set
with a large number of lines and a lot of columns that aren't equals.
If verbose
is TRUE, the column logged will be the one returned.
A list of index of columns that have an exact duplicate in the data_set set. Ex: if column i and column j (with j > i) are equal it will return j.
# First let's build a matrix with 3 columns and a lot of lines, with 1's everywhere
M <- matrix(1, nrow = 1e6, ncol = 3)
# Now let's check which columns are equals
which_are_in_double(M)
# It return 2 and 3: you should only keep column 1.
# Let's change the column 2, line 1 to 0. And check again
M[1, 2] <- 0
which_are_in_double(M)
# It only returns 3
# What about NA? NA vs not NA => not equal
M[1, 2] <- NA
which_are_in_double(M)
# It only returns 3
# What about NA? Na vs NA => yep it's the same
M[1, 1] <- NA
which_are_in_double(M)
# It only returns 2