MacroPCA {cellWise} R Documentation

MacroPCA

Description

This function performs the MacroPCA algorithm, which can deal with Missing values and Cellwise and Rowwise Outliers. Note that this function first calls `checkDataSet` and analyzes the remaining cleaned data.

Usage

```MacroPCA(X, k = 0, MacroPCApars = NULL)
```

Arguments

 `X` `X` is the input data, and must be an n by d matrix or a data frame. `k` `k` is the desired number of principal components. If `k = 0` or `k = NULL`, the algorithm will compute the percentage of explained variability for `k` upto `kmax` and show a scree plot, and suggest to choose a value of k such that the cumulative percentage of explained variability is at least 80 %. `MacroPCApars` A list of available options detailed below. If MacroPCApars = NULL the defaults below are used. `DDCpars` A list with parameters for the first step of the MacroPCA algorithm (for the complete list see the function `DDC`). Default is `NULL`. `kmax` The maximal number of principal components to compute. Default is `kmax = 10`. If `k` is provided kmax does not need to be specified, unless `k` is larger than 10 in which case you need to set `kmax` high enough. `alpha` This is the coverage, i.e. the fraction of rows the algorithm should give full weight. Alpha should be between 0.50 and 1, the default is 0.50. `scale` A value indicating whether and how the original variables should be scaled. If `scale = FALSE` or `scale = NULL` no scaling is performed (and a vector of 1s is returned in the `\$scaleX slot`). If `scale = TRUE` (default) the data are scaled by a 1-step M-estimator of scale with the Tukey biweight weight function to have a robust scale of 1. Alternatively scale can be a vector of length equal to the number of columns of `x`. The resulting scale estimates are returned in the `\$scaleX` slot of the MacroPCA output. `maxdir` The maximal number of random directions to use for computing the outlyingness of the data points. Default is `maxdir = 250`. If the number n of observations is small all n * (n - 1) / 2 pairs of observations are used. `distprob` The quantile determining the cutoff values for orthogonal and score distances. Default is 0.99. `silent` If TRUE, statements tracking the algorithm's progress will not be printed. Defaults to `FALSE`. `maxiter` Maximum number of iterations. Default is 20. `tol` Tolerance for iterations. Default is 0.005. `bigOutput` whether to compute and return NAimp, Cellimp and Fullimp. Defaults to `TRUE`.

Value

A list with components:

 `MacroPCApars` the options used in the call. `remX` Cleaned data after `checkDataSet`. `DDC` results of the first step of MacroPCA. These are needed to run MacroPCApredict on new data. `scaleX` the scales of the columns of `X`. `k` the number of principal components. `loadings` the columns are the `k` loading vectors. `eigenvalues` the `k` eigenvalues. `center` vector with the fitted center. `alpha` `alpha` from the input. `h` `h` (computed from `alpha`). `It` number of iteration steps. `diff` convergence criterion. `X.NAimp` data with all `NA`'s imputed by `MacroPCA`. `scores` scores of `X.NAimp`. `OD` orthogonal distances of the rows of `X.NAimp`. `cutoffOD` cutoff value for the OD. `SD` score distances of the rows of `X.NAimp`. `cutoffSD` cutoff value for the SD. `indrows` row numbers of rowwise outliers. `residScale` scale of the residuals. `stdResid` standardized residuals. Note that these are `NA` for all missing values of `X`. `indcells` indices of cellwise outliers. `NAimp` various results for the NA-imputed data. `Cellimp` various results for the cell-imputed data. `Fullimp` various result for the fully imputed data.

Author(s)

Rousseeuw P.J., Van den Bossche W.

References

Hubert, M., Rousseeuw, P.J., Van den Bossche W. (2019). MacroPCA: An all-in-one PCA method allowing for missing values as well as cellwise and rowwise outliers. Technometrics, 61(4), 459-473. (link to open access pdf)

`checkDataSet`, `cellMap`, `DDC`

Examples

```library(MASS)
set.seed(12345)
n <- 50; d <- 10
A <- matrix(0.9, d, d); diag(A) = 1
x <- mvrnorm(n, rep(0,d), A)
x[sample(1:(n * d), 50, FALSE)] <- NA
x[sample(1:(n * d), 50, FALSE)] <- 10
x <- cbind(1:n, x)
MacroPCA.out <- MacroPCA(x, 2)
cellMap(MacroPCA.out\$remX, MacroPCA.out\$stdResid,
columnlabels = 1:d, rowlabels = 1:n)
# For more examples, we refer to the vignette:
vignette("MacroPCA_examples")
```

[Package cellWise version 2.2.5 Index]