cerioli2010.fsrmcd.test {CerioliOutlierDetection}R Documentation

Finite-Sample Reweighted MCD Outlier Detection Test of Cerioli (2010)

Description

Given a set of observations, this function tests whether there are outliers in the data set and identifies outlying points. Outlier testing/identification is done using the Mahalanobis-distances based on the MCD dispersion estimate. The finite-sample reweighted MCD method of Cerioli (2010) is used to test for unusually large distances, which indicate possible outliers.

Usage

cerioli2010.fsrmcd.test(datamat, 
  mcd.alpha = max.bdp.mcd.alpha(n,v), 
  signif.alpha = 0.05, nsamp = 500, 
  nmini = 300, trace = FALSE, 
  delta = 0.025, hrdf.method=c("GM14","HR05"))

Arguments

datamat

(Data Frame or Matrix) Data set to test for outliers (rows = observations, columns = variables). datamat cannot have missing values; please deal with them prior to calling this function. datamat will be converted to a matrix.

mcd.alpha

(Numeric) Value to control the fraction of observations used to compute the covariance matrices in the MCD calculation. Default value is corresponds to the maximum breakpoint case of the MCD; valid values are between 0.5 and 1. See the covMcd documentation in the robustbase library for further details.

signif.alpha

(Numeric) Desired nominal size α of the individual outlier test (default value is 0.05). Equivalently, significance level at which to test individual observations for outlyingness. (This is the α parameter in Cerioli (2010).) To test the intersection hypothesis of no outliers in the data, specify

alpha = 1 - (1 - gamma)^(1/n),

where γ is the nominal size of the intersection test and n is the number of observations.

nsamp

(Integer) Number of subsamples to use in computing the MCD. See the covMcd documentation in the robustbase library.

nmini

(Integer) See the covMcd documentation in the robustbase library.

trace

(Logical) See the covMcd documentation in the robustbase library.

delta

(Numeric) False-positive rate to use in the reweighting step (Step 2). Defaults to 0.025 as used in Cerioli (2010). When the ratio n/ν of sample size to dimension is very small, using a smaller delta can improve the accuracy of the method.

hrdf.method

(String) Method to use for computing degrees of freedom and cutoff values for the non-MCD subset. The original method of Hardin and Rocke (2005) and the expanded method of Green and Martin (2014) are available as the options “HR05” and “GM14”, respectively. “GM14” is the default, as it is more accurate across a wider range of mcd.alpha values.

Value

mu.hat

Location estimate from the MCD calculation

sigma.hat

Dispersion estimate from the MCD calculation

mahdist

Mahalanobis distances calculated using the MCD estimate

DD

Hardin-Rocke or Green-Martin critical values for testing MCD distances. Used to produce weights for reweighted MCD. See Equation (16) in Cerioli (2010).

weights

Weights used in the reweighted MCD. See Equation (16) in Cerioli (2010).

mu.hat.rw

Location estimate from the reweighted MCD calculation

sigma.hat.rw

Dispersion estimate from the reweighted MCD calculation

mahdist.rw

a matrix of dimension nrow(datamat) by length(signif.alpha) of Mahalanobis distances computed using the finite-sample reweighted MCD methodology in Cerioli (2010). Even though the distances do not depend on signif.alpha, there is one column per entry in signif.alpha for user convenience.

critvalfcn

Function to compute critical values for Mahalanobis distances based on the reweighted MCD; see Equations (18) and (19) in Cerioli (2010). The function takes a signifance level as its only argument, and provides a critical value for each of the original observations (though there will only be two unique values, one for points included in the reweighted MCD (weights == 1) and one for points excluded from the reweighted MCD (weights == 0)).

signif.alpha

Significance levels used in testing.

mcd.alpha

Fraction of the observations used to compute the MCD estimate

outliers

A matrix of dimension nrow(datamat) by length(signif.alpha) indicating whether each row of datamat is an outlier. The i-th column corresponds to the result of testing observations for outlyingness at significance level signif.alpha[i].

Author(s)

Written and maintained by Christopher G. Green <christopher.g.green@gmail.com>

References

Andrea Cerioli. Multivariate outlier detection with high-breakdown estimators. Journal of the American Statistical Association, 105(489):147-156, 2010. doi: 10.1198/jasa.2009.tm09147

Andrea Cerioli, Marco Riani, and Anthony C. Atkinson. Controlling the size of multivariate outlier tests with the MCD estimator of scatter. Statistical Computing, 19:341-353, 2009. doi: 10.1007/s11222-008-9096-5

See Also

cerioli2010.irmcd.test

Examples


require(mvtnorm, quiet=TRUE)

############################################
# dimension v, number of observations n
v <- 5
n <- 200
simdata <- array( rmvnorm(n*v, mean=rep(0,v), 
    sigma = diag(rep(1,v))), c(n,v) )
#
# detect outliers with nominal sizes 
# c(0.05,0.01,0.001)
#
sa <- 1. - ((1. - c(0.05,0.01,0.001))^(1./n))
results <- cerioli2010.fsrmcd.test( simdata, 
    signif.alpha=sa )
# count number of outliers detected for each 
# significance level
colSums( results$outliers )


#############################################
# add some contamination to illustrate how to 
# detect outliers using the fsrmcd test
# 10/200 = 5% contamination
simdata[ sample(n,10), ] <- array( 
  rmvnorm( 10*v, mean=rep(2,v), sigma = diag(rep(1,v))),
  c(10,v)
)
results <- cerioli2010.fsrmcd.test( simdata, 
  signif.alpha=sa )
colMeans( results$outliers )


## Not run: 
#############################################
# example of how to ensure the size of the intersection test is correct

  n.sim <- 5000
  simdata <- array( 
    rmvnorm(n*v*n.sim, mean=rep(0,v), sigma=diag(rep(1,v))),
    c(n,v,n.sim)
  )
  # in practice we'd do this using one of the parallel processing
  # methods out there
  sa <- 1. - ((1. - 0.01)^(1./n))
  results <- apply( simdata, 3, function(dm) {
    z <- cerioli2010.fsrmcd.test( dm, 
      signif.alpha=sa )
    # true if outliers were detected in the data, false otherwise
    any(z$outliers[,1,drop=TRUE])
  })
  # count the percentage of samples where outliers were detected;
  # should be close to the significance level value used (0.01) in these
  # samples for the intersection test.
  mean(results)


## End(Not run)

[Package CerioliOutlierDetection version 1.1.9 Index]