R: Generalized S-Estimator in the presence of missing data

GSE {GSE}

R Documentation

Generalized S-Estimator in the presence of missing data

Description

Computes the Generalized S-Estimate (GSE) – a robust estimate of location and scatter for data with contamination and missingness.

Usage

GSE(x, tol=1e-4, maxiter=150, method=c("bisquare","rocke"), 
    init=c("emve","qc","huber","imputed","emve_c"), mu0, S0, ...)

Arguments

`x`	a matrix or data frame. May contain missing values, but cannot contain columns with completely missing entries.
`tol`	tolerance for the convergence criterion. Default is 1e-4.
`maxiter`	maximum number of iterations for the GSE algorithm. Default is 150.
`method`	which loss function to use: 'bisquare', 'rocke'.
`init`	type of initial estimator. Currently this can either be "emve" (EMVE with uniform sampling, see Danilov et al., 2012), "qc" (QC, see Danilov et al., 2012), "huber" (Huber Pairwise, see Danilov et al., 2012), "imputed" (Imputed S-estimator, see the rejoinder in Agostinelli et al., 2015), or "emve_c" (EMVE_C with cluster sampling, see Leung and Zamar, 2016). Default is "emve". If `mu0` and `S0` are provided, this argument is ignored.
`mu0`	optional vector of initial location estimate
`S0`	optional matrix of initial scatter estimate
`...`	optional arguments for computing the initial estimates (see `emve`, `HuberPairwise`).

Details

This function computes GSE (Danilov et al., 2012) and GRE (Leung and Zamar, 2016). The estimator requires a robust positive definite initial estimator. This initial estimator is required to “re-scale" the partial square mahalanobis distance for the different missing pattern, in which a single scale parameter is not enough. This function currently allows two main initial estimators: EMVE (the default; see emve and Huberized Pairwise (see HuberPairwise). GSE using Huberized Pairwise with sign psi function is referred to as QGSE in Danilov et al. (2012). Numerical results have shown that GSE with EMVE as initial has better performance (in both efficiency and robustness), but computing time can be longer.

Value

An S4 object of class GSE-class which is a subclass of the virtual class CovRobMissSc-class. The output S4 object contains the following slots:

`mu`	Estimated location. Can be accessed via `getLocation`.
`S`	Estimated scatter matrix. Can be accessed via `getScatter`.
`sc`	Generalized S-scale (GS-scale). Can be accessed via `getScale`.
`pmd`	Squared partial Mahalanobis distances. Can be accessed via `getDist`.
`pmd.adj`	Adjusted squared partial Mahalanobis distances. Can be accessed via `getDistAdj`.
`pu`	Dimension of the observed entries for each case. Can be accessed via `getDim`.
`mu0`	Estimated initial location.
`S0`	Estimated initial scatter matrix.
`ximp`	Input data matrix with missing values imputed using best linear predictor. Not meant to be accessed.
`weights`	Weights used in the estimation of the location. Not meant to be accessed.
`weightsp`	First derivative of the weights used in the estimation of the location. Not meant to be accessed.
`iter`	Number of iterations till convergence. Not meant to be accessed.
`eps`	relative change of the GS-scale at convergence. Not meant to be accessed.
`call`	Object of class `"language"`. Not meant to be accessed.
`x`	Input data matrix. Not meant to be accessed.
`p`	Column dimension of input data matrix. Not meant to be accessed.
`estimator`	Character string of the name of the estimator used. Not meant to be accessed.

Author(s)

Andy Leung andy.leung@stat.ubc.ca, Ruben H. Zamar, Mike Danilov, Victor J. Yohai

References

Agostinelli, C., Leung, A. , Yohai, V.J., and Zamar, R.H. (2015) Robust estimation of multivariate location and scatter in the presence of cellwise and casewise contamination. TEST.

Danilov, M., Yohai, V.J., Zamar, R.H. (2012). Robust Esimation of Multivariate Location and Scatter in the Presence of Missing Data. Journal of the American Statistical Association 107, 1178–1186.

Leung, A. and Zamar, R.H. (2016). Multivariate Location and Scatter Matrix Estimation Under Cellwise and Casewise Contamination. Submitted.

Examples

set.seed(12)

## generate 10-dimensional data with 10% casewise contamination
n <- 100
p <- 10
A <- matrix(0.9, p, p)
diag(A) <- 1
x <- generate.casecontam(n, p, cond=100, contam.size=10, contam.prop=0.1, A=A)$x

## introduce 5% missingness
pmiss <- 0.05
nmiss <- matrix(rbinom(n*p,1,pmiss), n,p)
x[ which( nmiss == 1 ) ] <- NA

## Using EMVE as initial
res.emve <- GSE(x)
slrt( getScatter(res.emve), A) ## LRT distances to the true covariance

## Using QC as initial
res.qc <- GSE(x, init="qc")
slrt( getScatter(res.qc), A) ## in general performs worse than if EMVE used as initials

[Package GSE version 4.2-1 Index]