R: Generate sparse data with outliers

dataGen {rospca}

R Documentation

Generate sparse data with outliers

Description

Generate sparse data with outliers using simulation scheme detailed in Hubert et al. (2016).

Usage

dataGen(m = 100, n = 100, p = 10, a = c(0.9,0.5,0), bLength = 4, SD = c(10,5,2), 
        eps = 0, seed = TRUE)

Arguments

`m`	Number of datasets to generate, default is 100.
`n`	Number of observations, default is 100.
`p`	Number of dimensions, default is 10.
`a`	Numeric vector containing the inner group correlations for each block. The number of useful blocks is thus given by `k=length(a)-1` which should be at least 2. By default, the correlations are equal to 0.9, 0.5 and 0, respectively.
`bLength`	Length of the blocks of useful variables, default is 4.
`SD`	Numeric vector containing the standard deviations of the blocks of variables, default is `c(10,4,2)`. Note that `SD` and `a` should have the same length.
`eps`	Proportion of contamination, should be between 0 and 0.5. Default is 0 (no contamination).
`seed`	Logical indicating if a seed is used when generating the datasets, default is `TRUE`.

Details

Firstly, we generate a correlation matrix such that it has sparse eigenvectors. We design the correlation matrix to have length(a)=k+1 groups of variables with no correlation between variables from different groups. The first k groups consist of bLength variables each. The correlation between the different variables of the group is equal to a[1] for group 1, .... . The (k+1)th group contains the remaining p-k \times bLength variables, which we specify to have correlation a[k+1].
Secondly, the correlation matrix R is transformed into the covariance matrix \Sigma= V^{0.5} \cdot R \cdot V^{0.5}, where V=diag(SD^2).
Thirdly, the n observations are generated from a p-variate normal distribution with mean the p-variate zero-vector and covariance matrix \Sigma. Standard normally distributed noise terms are also added to each of the p variables to make the sparse structure of the data harder to detect.
Finally, (100 \times eps)\% of the data points are randomly replaced by outliers. These outliers are generated from a p-variate normal distribution as in Croux et al. (2013).
The ith eigenvector of R, for i=1,...,k, is given by a (sparse) vector with the (bLength \times (i-1)+1)th till the (bLength \times i)th elements equal to 1/\sqrt{bLength} and all other elements equal to zero.
See Hubert et al. (2016) for more details.

Value

A list with components:

`data`	List of length `m` containing all data matrices.
`ind`	List of length `m` containing the numeric vectors with the indices of the contaminated observations.
`R`	Correlation matrix of the data, a numeric matrix of size `p` by `p`.
`Sigma`	Covariance matrix of the data (`\Sigma`), a numeric matrix of size `p` by `p`.

Author(s)

Tom Reynkens

References

Hubert, M., Reynkens, T., Schmitt, E. and Verdonck, T. (2016). “Sparse PCA for High-Dimensional Data with Outliers,” Technometrics, 58, 424–434.

Croux, C., Filzmoser, P., and Fritz, H. (2013), “Robust Sparse Principal Component Analysis,” Technometrics, 55, 202–214.

Examples

X <- dataGen(m=1, n=100, p=10, eps=0.2, bLength=4)$data[[1]]

resR <- robpca(X, k=2, skew=FALSE)
diagPlot(resR)

[Package rospca version 1.1.0 Index]