dataGen {rospca}R Documentation

Generate sparse data with outliers

Description

Generate sparse data with outliers using simulation scheme detailed in Hubert et al. (2016).

Usage

dataGen(m = 100, n = 100, p = 10, a = c(0.9,0.5,0), bLength = 4, SD = c(10,5,2), 
        eps = 0, seed = TRUE)

Arguments

m

Number of datasets to generate, default is 100.

n

Number of observations, default is 100.

p

Number of dimensions, default is 10.

a

Numeric vector containing the inner group correlations for each block. The number of useful blocks is thus given by k=length(a)-1 which should be at least 2. By default, the correlations are equal to 0.9, 0.5 and 0, respectively.

bLength

Length of the blocks of useful variables, default is 4.

SD

Numeric vector containing the standard deviations of the blocks of variables, default is c(10,4,2). Note that SD and a should have the same length.

eps

Proportion of contamination, should be between 0 and 0.5. Default is 0 (no contamination).

seed

Logical indicating if a seed is used when generating the datasets, default is TRUE.

Details

Firstly, we generate a correlation matrix such that it has sparse eigenvectors. We design the correlation matrix to have length(a)=k+1 groups of variables with no correlation between variables from different groups. The first k groups consist of bLength variables each. The correlation between the different variables of the group is equal to a[1] for group 1, .... . The (k+1)th group contains the remaining p-k \times bLength variables, which we specify to have correlation a[k+1].
Secondly, the correlation matrix R is transformed into the covariance matrix \Sigma= V^{0.5} \cdot R \cdot V^{0.5}, where V=diag(SD^2).
Thirdly, the n observations are generated from a p-variate normal distribution with mean the p-variate zero-vector and covariance matrix \Sigma. Standard normally distributed noise terms are also added to each of the p variables to make the sparse structure of the data harder to detect.
Finally, (100 \times eps)\% of the data points are randomly replaced by outliers. These outliers are generated from a p-variate normal distribution as in Croux et al. (2013).
The ith eigenvector of R, for i=1,...,k, is given by a (sparse) vector with the (bLength \times (i-1)+1)th till the (bLength \times i)th elements equal to 1/\sqrt{bLength} and all other elements equal to zero.
See Hubert et al. (2016) for more details.

Value

A list with components:

data

List of length m containing all data matrices.

ind

List of length m containing the numeric vectors with the indices of the contaminated observations.

R

Correlation matrix of the data, a numeric matrix of size p by p.

Sigma

Covariance matrix of the data (\Sigma), a numeric matrix of size p by p.

Author(s)

Tom Reynkens

References

Hubert, M., Reynkens, T., Schmitt, E. and Verdonck, T. (2016). “Sparse PCA for High-Dimensional Data with Outliers,” Technometrics, 58, 424–434.

Croux, C., Filzmoser, P., and Fritz, H. (2013), “Robust Sparse Principal Component Analysis,” Technometrics, 55, 202–214.

Examples

X <- dataGen(m=1, n=100, p=10, eps=0.2, bLength=4)$data[[1]]

resR <- robpca(X, k=2, skew=FALSE)
diagPlot(resR)

[Package rospca version 1.1.0 Index]