dataGen {rospca} | R Documentation |
Generate sparse data with outliers
Description
Generate sparse data with outliers using simulation scheme detailed in Hubert et al. (2016).
Usage
dataGen(m = 100, n = 100, p = 10, a = c(0.9,0.5,0), bLength = 4, SD = c(10,5,2),
eps = 0, seed = TRUE)
Arguments
m |
Number of datasets to generate, default is 100. |
n |
Number of observations, default is 100. |
p |
Number of dimensions, default is 10. |
a |
Numeric vector containing the inner group correlations for each block. The number of useful blocks is thus given by |
bLength |
Length of the blocks of useful variables, default is 4. |
SD |
Numeric vector containing the standard deviations of the blocks of variables, default is |
eps |
Proportion of contamination, should be between 0 and 0.5. Default is 0 (no contamination). |
seed |
Logical indicating if a seed is used when generating the datasets, default is |
Details
Firstly, we generate a correlation matrix such that it has sparse eigenvectors.
We design the correlation matrix to have groups of variables with no correlation between variables from different groups. The first
groups consist of
bLength
variables each. The correlation between the different variables of the group is equal to a[1]
for group 1, .... . The (k+1)th group contains the remaining variables, which we specify to have correlation
a[k+1]
.
Secondly, the correlation matrix R
is transformed into the covariance matrix , where
.
Thirdly, the n
observations are generated from a -variate normal distribution with mean the
-variate zero-vector and covariance matrix
. Standard normally distributed noise terms are also added to each of the
p
variables to make the sparse structure of the data harder to detect.
Finally, of the data points are randomly replaced by outliers.
These outliers are generated from a
-variate normal distribution as in Croux et al. (2013).
The th eigenvector of
, for
, is given by a (sparse) vector with the
th till the
th elements equal to
and all other elements equal to zero.
See Hubert et al. (2016) for more details.
Value
A list with components:
data |
List of length |
ind |
List of length |
R |
Correlation matrix of the data, a numeric matrix of size |
Sigma |
Covariance matrix of the data ( |
Author(s)
Tom Reynkens
References
Hubert, M., Reynkens, T., Schmitt, E. and Verdonck, T. (2016). “Sparse PCA for High-Dimensional Data with Outliers,” Technometrics, 58, 424–434.
Croux, C., Filzmoser, P., and Fritz, H. (2013), “Robust Sparse Principal Component Analysis,” Technometrics, 55, 202–214.
Examples
X <- dataGen(m=1, n=100, p=10, eps=0.2, bLength=4)$data[[1]]
resR <- robpca(X, k=2, skew=FALSE)
diagPlot(resR)