ICS_outlier {ICSOutlier}R Documentation

Outlier Detection Using ICS

Description

In a multivariate framework outlier(s) are detected using ICS. The function performs ICS() and decides automatically about the number of invariant components to use to search for the outliers and the number of outliers detected on these components. Currently the function is restricted to the case of searching outliers only on the first components.

Usage

ICS_outlier(
  X,
  S1 = ICS_cov,
  S2 = ICS_cov4,
  S1_args = list(),
  S2_args = list(),
  ICS_algorithm = c("whiten", "standard", "QR"),
  method = "norm_test",
  test = "agostino.test",
  n_eig = 10000,
  level_test = 0.05,
  adjust = TRUE,
  level_dist = 0.025,
  n_dist = 10000,
  type = "smallprop",
  n_cores = NULL,
  iseed = NULL,
  pkg = "ICSOutlier",
  q_type = 7,
  ...
)

Arguments

X

a numeric matrix or data frame containing the data to be transformed.

S1

an object of class "ICS_scatter" or a function that contains the location vector and scatter matrix as location and scatter components.

S2

an object of class "ICS_scatter" or a function that contains the location vector and scatter matrix as location and scatter components.

S1_args

a list containing additional arguments for S1.

S2_args

a list containing additional arguments for S2.

ICS_algorithm

a character string specifying with which algorithm the invariant coordinate system is computed. Possible values are "whiten", "standard" or "QR".

method

name of the method used to select the ICS components involved to compute ICS distances. Options are "norm_test" and "simulation". Depending on the method either comp_norm_test or comp_simu_test are used.

test

name of the marginal normality test to use if method = "norm_test". Possibilities are "jarque.test", "anscombe.test", "bonett.test", "agostino.test", "shapiro.test".Default is "agostino.test".

n_eig

number of simulations performed to derive the cut-off values for selecting the ICS components. Only if method = "simulation". See comp_simu_test for details.

level_test

for the comp_norm_test or comp_simu_test functions. The initial level for selecting the invariant coordinates.

adjust

logical. For selecting the invariant coordinates, the level of the test can be adjusted for each component to deal with multiple testing. See comp_norm_test and comp_simu_test for details. Default is TRUE.

level_dist

level for the dist_simu_test function. The (1-level)th quantile used to determine the cut-off value for the ICS distances.

n_dist

number of simulations performed to derive the cut-off value for the ICS distances. See dist_simu_test for details.

type

currently the only option is "smallprop" which means that only the first ICS components can be selected. See comp_norm_test or comp_simu_test for details.

n_cores

number of cores to be used in dist_simu_test and comp_simu_test. If NULL or 1, no parallel computing is used. Otherwise makeCluster with type = "PSOCK" is used.

iseed

If parallel computation is used the seed passed on to clusterSetRNGStream. Default is NULL which means no fixed seed is used.

pkg

When using parallel computing, a character vector listing all the packages which need to be loaded on the different cores via require. Must be at least "ICSOutlier" and must contain the packages needed to compute the scatter matrices.

q_type

specifies the quantile algorithm used in quantile.

...

passed on to other methods.

Details

The ICS method has attractive properties for outlier detection in the case of a small proportion of outliers. As for PCA three steps have to be performed:(i) select the components most useful for the detection, (ii) compute distances as outlierness measures for all observation and finally (iii) label outliers using some cut-off value.

This function performs these three steps automatically:

As a rule of thumb, the percentage of contamination should be limited to 10% in case of a mixture of gaussian distributions and using the default combination of locations and scatters for ICS.

Value

An object of S3-class 'ICS_Out' which contains:

Author(s)

Aurore Archimbaud and Klaus Nordhausen

References

Archimbaud, A., Nordhausen, K. and Ruiz-Gazen, A. (2018), ICS for multivariate outlier detection with application to quality control. Computational Statistics & Data Analysis, 128:184-199. ISSN 0167-9473. doi:10.1016/j.csda.2018.06.011.

See Also

ICS(), comp_norm_test(), comp_simu_test(), dist_simu_test() and print(), plot(), summary() methods

Examples

# ReliabilityData example: the observations 414 and 512 are suspected to be outliers  
library(REPPlab)
data(ReliabilityData)
# For demo purpose only small mDist value, but as extreme quantiles
# are of interest mDist should be much larger. Also number of cores used
# should be larger if available
icsOutlierDA <- ICS_outlier(ReliabilityData, S1 = ICS_tM, S2 = ICS_cov, 
level_dist = 0.01, n_dist = 50, n_cores = 1)
icsOutlierDA
summary(icsOutlierDA)
plot(icsOutlierDA)

## Not run: 
  # For using several cores and for using a scatter function from a different package
  # Using the parallel package to detect automatically the number of cores
  library(parallel)
  # ICS with MCD estimates and the usual estimates
  # Need to create a wrapper for the CovMcd function to return first the location estimate
  # and the scatter estimate secondly.
  data(HTP)
 library(ICSClust)
  # For demo purpose only small m value, should select the first seven components
  icsOutlier <- ICS_outlier(HTP, S1 = ICS_mcd_rwt, S2 = ICS_cov,
                            S1_args = list(location = TRUE, alpha = 0.75),
                            n_eig = 50, level_test = 0.05, adjust = TRUE,
                            level_dist = 0.025, n_dist = 50,
                            n_cores =  detectCores()-1, iseed = 123,
                            pkg = c("ICSOutlier", "ICSClust"))
  icsOutlier

## End(Not run)

# Exemple of no direction and hence also no outlier
set.seed(123)
X = rmvnorm(500, rep(0, 2), diag(rep(0.1,2)))
icsOutlierJB <- ICS_outlier(X, test = "jarque.test", level_dist = 0.01,
                            level_test = 0.01, n_dist = 100, n_cores = 1)
summary(icsOutlierJB)
plot(icsOutlierJB)
rm(.Random.seed)

# Example of no outlier
set.seed(123)
X = matrix(rweibull(1000, 4, 4), 500, 2)
X = apply(X,2, function(x){ifelse(x<5 & x>2, x, runif(sum(!(x<5 & x>2)), 5, 5.5))})
icsOutlierAG <- ICS_outlier(X, test = "anscombe.test", level_dist = 0.01,
                            level_test = 0.05, n_dist = 100, n_cores = 1)
summary(icsOutlierAG)
plot(icsOutlierAG)
rm(.Random.seed)

[Package ICSOutlier version 0.4-0 Index]