R: RObust Sparse PCA algorithm

rospca {rospca}

R Documentation

RObust Sparse PCA algorithm

Description

Sparse robust PCA algorithm based on the ROBPCA algorithm of Hubert et al. (2005).

Usage

rospca(X, k, kmax = 10, alpha = 0.75, h = NULL, ndir = "all", grid = TRUE, 
       lambda = 10^(-6), sparse = "varnum", para, stand = TRUE, skew = FALSE)

Arguments

`X`	An `n` by `p` matrix or data matrix with observations in the rows and variables in the columns.
`k`	Number of principal components that will be used.
`kmax`	Maximal number of principal components that will be computed, default is 10.
`alpha`	Robustness parameter, default is 0.75.
`h`	The number of outliers the algorithm should resist is given by `n-h`. Any value for `h` between `n/2` and `n` may be specified. Default is `NULL` which uses `h=ceiling(alpha*n)+1`. Do not specify `alpha` and `h` at the same time.
`ndir`	Number of directions used when computing the outlyingness (or the adjusted outlyingness when `skew=TRUE`), see `outlyingness` and `adjOutl` for more details.
`grid`	Logical indicating if the grid version of sparse PCA should be used (`sPCAgrid` with `method="sd"` from pcaPP). Otherwise, the version of Zou et al. (2006) is used (`spca` from elasticnet). Default is `TRUE`.
`lambda`	Sparsity parameter of `sPCAgrid` (when `grid=TRUE`) or ridge parameter of `spca` (when `grid=FALSE`), default is `10^{-6}`.
`sparse`	Parameter for `spca` (only used when `grid=FALSE`), see `spca` for more details.
`para`	Parameter for `spca` (only used when `grid=FALSE`), see `spca` for more details.
`stand`	If `TRUE`, the data are standardised robustly in the beginning and classically before applying sparse PCA. If `FALSE`, the data are only mean-centred before applying sparse PCA. Default is `TRUE`.
`skew`	Logical indicating if the version for skewed data should be applied, default is `FALSE`.

Details

The ROSPCA algorithm consists of an outlier detection part (step 1), and a sparsification part (steps 2 and 3). We give an overview of these steps here and refer to Hubert et al. (2016) for more details.

Step 1: This is a robustness step similar to ROBPCA. When a standardisation is appropriate, the variables are first robustly standardised by means of the componentwise median and the Q_n. Using the singular value decomposition (SVD) of the resulting data matrix, the p-dimensional data space is reduced to the affine subspace spanned by the n observations. Then, the subset of the h observations with smallest outlyingness is selected (H_0). Thereafter, a reweighting step is applied: given the orthogonal distances to the preliminary PCA subspace determined by the observations in H_0, all observations with orthogonal distances (ODs) smaller than the corresponding cut-off are kept (H_1).

Step 2: First, the data points with indices in H_1 are standardised using the componentwise median and the Q_n and sparse PCA is applied to them. Then, an additional reweighting step is performed which incorporates information about the sparse structure of the data. Variables with zero loadings on all k PCs are discarded and then the orthogonal distances to the estimated sparse PCA subspace are computed. This yields an index set H_2 of observations with orthogonal distance smaller than the cut-off corresponding to these new orthogonal distances. Thereafter, the subset of observations with indices in H_2 is standardised using the componentwise median and the Q_n of the observations in H_1 (the same standardisation as in the first time sparse PCA is applied) and sparse PCA is applied to them which gives sparse loadings. Adding the discarded zero loadings again gives the loadings matrix P_2.

Step 3: In the last step, the eigenvalues are estimated robustly by applying the Q_n^2 estimator on the scores of the observations with indices in H_2. In order to robustly estimate the centre, the score distances are computed and all observations of H_2 with a score distance smaller than the corresponding cut-off are considered, this is the set H_3. Then, the centre is estimated by the mean of these observations. Finally, the estimates of the eigenvalues are recomputed as the sample variance of the (new) scores of the observations with indices in H_3. The eigenvalues are sorted in descending order, so the order of the PCs may change. The columns of the loadings and scores matrices are changed accordingly.

Note that when it is not necessary to standardise the data, they are only centred as in the scheme above, but not scaled.

In contrast to Hubert et al. (2016), we allow for SPCA (Zou et al., 2006) to be used as the sparse PCA method inside ROSPCA (grid=FALSE). Moreover, we also include a skew-adjusted version of ROSPCA (skew=TRUE) similar to the skew-adjusted version of ROBPCA (Hubert et al., 2009). This adjusted version is not detailed in Hubert et al. (2016).

Value

A list with components:

`loadings`	Loadings matrix containing the sparse robust loadings (eigenvectors), a numeric matrix of size `p` by `k`.
`eigenvalues`	Numeric vector of length `k` containing the robust eigenvalues.
`scores`	Scores matrix (computed as `(X-center) \cdot loadings)`, a numeric matrix of size `n` by `k`.
`center`	Numeric vector of length `k` containing the centre of the data.
`D`	Matrix used to standardise the data before applying sparse PCA (identity matrix if `stand=FALSE`), a numeric matrix of size `p` by `p`.
`k`	Number of (chosen) principal components.
`H0`	Logical vector of size `n` indicating if an observation is in the initial h-subset.
`H1`	Logical vector of size `n` indicating if an observation is kept in the non-sparse reweighting step (in robust part).
`P1`	Loadings matrix before applying sparse reweighting step, a numeric matrix of size `p` by `k`.
`index`	Numeric vector containing the indices of the variables that are used in the sparse reweighting step.
`H2`	Logical vector of size `n` indicating if an observation is kept in the sparse reweighting step.
`P2`	Loadings matrix before estimating eigenvalues, a numeric matrix of size `p` by `k`.
`H3`	Logical vector of size `n` indicating if an observation is kept in the final SD reweighting step.
`alpha`	The robustness parameter `\alpha` used throughout the algorithm.
`h`	The `h`-parameter used throughout the algorithm.
`sd`	Numeric vector of size `n` containing the robust score distances within the robust PCA subspace.
`od`	Numeric vector of size `n` containing the orthogonal distances to the robust PCA subspace.
`cutoff.sd`	Cut-off value for the robust score distances.
`cutoff.od`	Cut-off value for the orthogonal distances.
`flag.sd`	Numeric vector of size `n` containing the SD-flags of the observations. The observations whose score distance is larger than `cutoff.sd` receive an SD-flag equal to zero. The other observations receive an SD-flag equal to 1.
`flag.od`	Numeric vector of size `n` containing the OD-flags of the observations. The observations whose orthogonal distance is larger than `cutoff.od` receive an OD-flag equal to zero. The other observations receive an OD-flag equal to 1.
`flag.all`	Numeric vector of size `n` containing the flags of the observations. The observations whose score distance is larger than `cutoff.sd` or whose orthogonal distance is larger than `cutoff.od` can be considered as outliers and receive a flag equal to zero. The regular observations receive flag 1.

Author(s)

Tom Reynkens, based on R code from Valentin Todorov for PcaHubert in rrcov (released under GPL-3) and Matlab code from Katrien Van Driessen (for the univariate MCD).

References

Hubert, M., Reynkens, T., Schmitt, E. and Verdonck, T. (2016). “Sparse PCA for High-Dimensional Data with Outliers,” Technometrics, 58, 424–434.

Hubert, M., Rousseeuw, P. J., and Vanden Branden, K. (2005), “ROBPCA: A New Approach to Robust Principal Component Analysis,” Technometrics, 47, 64–79.

Hubert, M., Rousseeuw, P. J., and Verdonck, T. (2009), “Robust PCA for Skewed Data and Its Outlier Map," Computational Statistics & Data Analysis, 53, 2264–2274.

Croux, C., Filzmoser, P., and Fritz, H. (2013), “Robust Sparse Principal Component Analysis,” Technometrics, 55, 202–214.

Zou, H., Hastie, T., and Tibshirani, R. (2006), “Sparse Principal Component Analysis,” Journal of Computational and Graphical Statistics, 15, 265–286.

Examples

X <- dataGen(m=1, n=100, p=10, eps=0.2, bLength=4)$data[[1]]

resRS <- rospca(X, k=2, lambda=0.4, stand=TRUE)
diagPlot(resRS)

[Package rospca version 1.1.0 Index]