R: Standard deviation outlier filtering

remove_sd_outlier {dataPreparation}

R Documentation

Standard deviation outlier filtering

Description

Remove outliers based on standard deviation thresholds.
Only values within mean - sd * n_sigmas and mean + sd * n_sigmas are kept.

Usage

remove_sd_outlier(data_set, cols = "auto", n_sigmas = 3, verbose = TRUE)

Arguments

`data_set`	Matrix, data.frame or data.table
`cols`	List of numeric column(s) name(s) of data_set to transform. To transform all numeric columns, set it to "auto". (character, default to "auto")
`n_sigmas`	number of times standard deviation is accepted (integer, default to 3)
`verbose`	Should the algorithm talk? (logical, default to TRUE)

Details

Filtering is made column by column, meaning that extreme values from first element of cols are removed, then extreme values from second element of cols are removed, ...
So if filtering is performed on too many column, there ia high risk that a lot of rows will be dropped.

Value

Same dataset with less rows, edited by reference.
If you don't want to edit by reference please provide set data_set = copy(data_set).

Examples

# Given
library(data.table)
col_vals <- runif(1000)
col_mean <- mean(col_vals)
col_sd <- sd(col_vals)
extreme_val <- col_mean + 6 * col_sd
data_set <- data.table(num_col = c(col_vals, extreme_val))

# When
data_set <- remove_sd_outlier(data_set, cols = "auto", n_sigmas = 3, verbose = TRUE)

# Then extreme value is no longer in set
extreme_val %in% data_set[["num_col"]] # Is false

[Package dataPreparation version 1.1.1 Index]