generateData {cellWise}R Documentation

Generates artificial datasets with outliers

Description

This function generates multivariate normal datasets with several possible types of outliers. It is used in several simulation studies. For a detailed description, see the referenced papers.

Usage

generateData(n, d, mu, Sigma, perout, gamma,
             outlierType = "casewise", seed = NULL)

Arguments

n

The number of observations

d

The dimension of the data.

mu

The center of the clean data.

Sigma

The covariance matrix of the clean data. Could be obtained from generateCorMat.

outlierType

The type of contamination to be generated. Should be one of:

  • "casewise": Generates point contamination in the direction of the last eigenvector of Sigma.

  • "cellwisePlain": Generates cellwise contamination by randomly replacing a number of cells by gamma.

  • "cellwiseStructured": Generates cellwise contamination by first randomly sampling contaminated cells, after which for each row, they are replaced by a multiple of the smallest eigenvector of Sigma restricted to the dimensions of the contaminated cells.

  • "both": combines "casewise" and "cellwiseStructured".

perout

The percentage of generated outliers. For outlierType = "casewise" this is a fraction of rows. For outlierType = "cellWisePlain" or outlierType = "cellWiseStructured", a fraction of perout cells are replaced by contaminated cells. For outlierType = "both", a fraction of 0.5*perout of rowwise outliers is generated, after which the remaining data is contaminated with a fraction of 0.5*perout outlying cells.

gamma

How far outliers are from the center of the distribution.

seed

Seed used to generate the data.

Value

A list with components:

Author(s)

J. Raymaekers and P.J. Rousseeuw

References

C. Agostinelli, Leung, A., Yohai, V. J., and Zamar, R. H. (2015). Robust Estimation of Multivariate Location and Scatter in the Presence of Cellwise and Casewise Contamination. Test, 24, 441-461.

Rousseeuw, P.J., Van den Bossche W. (2018). Detecting Deviating Data Cells. Technometrics, 60(2), 135-145. (link to open access pdf)

J. Raymaekers and P.J. Rousseeuw (2020). Handling cellwise outliers by sparse regression and robust covariance. Arxiv: 1912.12446. (link to open access pdf)

See Also

generateCorMat

Examples

n     <- 100
d     <- 5
mu    <- rep(0, d)
Sigma <- diag(d)
perout <- 0.1
gamma <- 10
data <- generateData(n, d, mu, Sigma, perout, gamma, outlierType = "cellwisePlain", seed  = 1)
pairs(data$X)
data$indcells

[Package cellWise version 2.5.3 Index]