DDC {cellWise}R Documentation

Detect Deviating Cells

Description

This function aims to detect cellwise outliers in the data. These are entries in the data matrix which are substantially higher or lower than what could be expected based on the other cells in its column as well as the other cells in its row, taking the relations between the columns into account. Note that this function first calls checkDataSet and analyzes the remaining cleaned data.

Usage

DDC(X, DDCpars = list())

Arguments

X

X is the input data, and must be an n by d matrix or a data frame.

DDCpars

A list of available options:

  • fracNA
    Only consider columns and rows with fewer NAs (missing values) than this fraction (percentage). Defaults to 0.5.

  • numDiscrete
    A column that takes on numDiscrete or fewer values will be considered discrete and not used in the analysis. Defaults to 3.

  • precScale
    Only consider columns whose scale is larger than precScale. Here scale is measured by the median absolute deviation. Defaults to 1e-12.

  • cleanNAfirst
    If "columns", first columns then rows are checked for NAs. If "rows", first rows then columns are checked for NAs. "automatic" checks columns first if d \geq 5n and rows first otherwise. Defaults to "automatic".

  • tolProb
    Tolerance probability, with default 0.99, which determines the cutoff values for flagging outliers in several steps of the algorithm.

  • corrlim
    When trying to estimate z_{ij} from other variables h, we will only use variables h with |\rho_{j,h}| \ge corrlim. Variables j without any correlated variables h satisfying this are considered standalone, and treated on their own. Defaults to 0.5.

  • combinRule
    The operation to combine estimates of z_{ij} coming from other variables h: can be "mean", "median", "wmean" (weighted mean) or "wmedian" (weighted median). Defaults to wmean.

  • returnBigXimp
    If TRUE, the imputed data matrix Ximp in the output will include the rows and columns that were not part of the analysis (and can still contain NAs). Defaults to FALSE.

  • silent
    If TRUE, statements tracking the algorithm's progress will not be printed. Defaults to FALSE.

  • nLocScale
    When estimating location or scale from more than nLocScale data values, the computation is based on a random sample of size nLocScale to save time. When nLocScale = 0 all values are used. Defaults to 25000.

  • fastDDC
    Whether to use the fastDDC option or not. The fastDDC algorithm uses approximations to allow to deal with high dimensions. Defaults to TRUE for d > 750 and FALSE otherwise.

  • standType
    The location and scale estimators used for robust standardization. Should be one of "1stepM", "mcd" or "wrap". See estLocScale for more info. Only used when fastDDC = FALSE. Defaults to "1stepM".

  • corrType
    The correlation estimator used to find the neighboring variables. Must be one of "wrap" (wrapping correlation), "rank" (Spearman correlation) or "gkwls" (Gnanadesikan-Kettenring correlation followed by weighting). Only used when fastDDC = FALSE. Defaults to "gkwls".

  • transFun
    The transformation function used to compute the robust correlations when fastDDC = TRUE. Can be "wrap" or "rank". Defaults to "wrap".

  • nbngbrs
    When fastDDC = TRUE, each column is predicted from at most nbngbrs columns correlated to it. Defaults to 100.

Value

A list with components:

Author(s)

Raymaekers J., Rousseeuw P.J., Van den Bossche W.

References

Rousseeuw, P.J., Van den Bossche W. (2018). Detecting Deviating Data Cells. Technometrics, 60(2), 135-145. (link to open access pdf)

Raymaekers, J., Rousseeuw P.J. (2019). Fast robust correlation for high dimensional data. Technometrics, 63(2), 184-198. (link to open access pdf)

See Also

checkDataSet,cellMap

Examples

library(MASS); set.seed(12345)
n <- 50; d <- 20
A <- matrix(0.9, d, d); diag(A) = 1
x <- mvrnorm(n, rep(0,d), A)
x[sample(1:(n * d), 50, FALSE)] <- NA
x[sample(1:(n * d), 50, FALSE)] <- 10
x[sample(1:(n * d), 50, FALSE)] <- -10
x <- cbind(1:n, x)
DDCx <- DDC(x)
cellMap(DDCx$stdResid)

# For more examples, we refer to the vignette:
## Not run: 
vignette("DDC_examples")

## End(Not run)

[Package cellWise version 2.5.3 Index]