dataGen {rospca} | R Documentation |
Generate sparse data with outliers
Description
Generate sparse data with outliers using simulation scheme detailed in Hubert et al. (2016).
Usage
dataGen(m = 100, n = 100, p = 10, a = c(0.9,0.5,0), bLength = 4, SD = c(10,5,2),
eps = 0, seed = TRUE)
Arguments
m |
Number of datasets to generate, default is 100. |
n |
Number of observations, default is 100. |
p |
Number of dimensions, default is 10. |
a |
Numeric vector containing the inner group correlations for each block. The number of useful blocks is thus given by |
bLength |
Length of the blocks of useful variables, default is 4. |
SD |
Numeric vector containing the standard deviations of the blocks of variables, default is |
eps |
Proportion of contamination, should be between 0 and 0.5. Default is 0 (no contamination). |
seed |
Logical indicating if a seed is used when generating the datasets, default is |
Details
Firstly, we generate a correlation matrix such that it has sparse eigenvectors.
We design the correlation matrix to have length(a)=k+1
groups of variables with no correlation between variables from different groups. The first k
groups consist of bLength
variables each. The correlation between the different variables of the group is equal to a[1]
for group 1, .... . The (k+1)th group contains the remaining p-k \times bLength
variables, which we specify to have correlation a[k+1]
.
Secondly, the correlation matrix R
is transformed into the covariance matrix \Sigma= V^{0.5} \cdot R \cdot V^{0.5}
, where V=diag(SD^2)
.
Thirdly, the n
observations are generated from a p
-variate normal distribution with mean the p
-variate zero-vector and covariance matrix \Sigma
. Standard normally distributed noise terms are also added to each of the p
variables to make the sparse structure of the data harder to detect.
Finally, (100 \times eps)\%
of the data points are randomly replaced by outliers.
These outliers are generated from a p
-variate normal distribution as in Croux et al. (2013).
The i
th eigenvector of R
, for i=1,...,k
, is given by a (sparse) vector with the (bLength \times (i-1)+1)
th till the (bLength \times i)
th elements equal to 1/\sqrt{bLength}
and all other elements equal to zero.
See Hubert et al. (2016) for more details.
Value
A list with components:
data |
List of length |
ind |
List of length |
R |
Correlation matrix of the data, a numeric matrix of size |
Sigma |
Covariance matrix of the data ( |
Author(s)
Tom Reynkens
References
Hubert, M., Reynkens, T., Schmitt, E. and Verdonck, T. (2016). “Sparse PCA for High-Dimensional Data with Outliers,” Technometrics, 58, 424–434.
Croux, C., Filzmoser, P., and Fritz, H. (2013), “Robust Sparse Principal Component Analysis,” Technometrics, 55, 202–214.
Examples
X <- dataGen(m=1, n=100, p=10, eps=0.2, bLength=4)$data[[1]]
resR <- robpca(X, k=2, skew=FALSE)
diagPlot(resR)