MacroPCA {cellWise}R Documentation

MacroPCA

Description

This function performs the MacroPCA algorithm, which can deal with Missing values and Cellwise and Rowwise Outliers. Note that this function first calls checkDataSet and analyzes the remaining cleaned data.

Usage

MacroPCA(X, k = 0, MacroPCApars = NULL)

Arguments

X

X is the input data, and must be an n by d matrix or a data frame. It must always be provided.

k

k is the desired number of principal components. If k = 0 or k = NULL, the algorithm will compute the percentage of explained variability for k upto kmax and show a scree plot, and suggest to choose a value of k such that the cumulative percentage of explained variability is at least 80%.

MacroPCApars

A list of available options detailed below. If MacroPCApars = NULL the defaults below are used.

  • DDCpars
    A list with parameters for the first step of the MacroPCA algorithm (for the complete list see the function DDC). Default is NULL.

  • kmax
    The maximal number of principal components to compute. Default is kmax = 10. If k is provided kmax does not need to be specified, unless k is larger than 10 in which case you need to set kmax high enough.

  • alpha
    This is the coverage, i.e. the fraction of rows the algorithm should give full weight. Alpha should be between 0.50 and 1, the default is 0.50.

  • scale
    A value indicating whether and how the original variables should be scaled. If scale = FALSE or scale = NULL no scaling is performed (and a vector of 1s is returned in the $scaleX slot). If scale = TRUE (default) the data are scaled by a 1-step M-estimator of scale with the Tukey biweight weight function to have a robust scale of 1. Alternatively scale can be a vector of length equal to the number of columns of x. The resulting scale estimates are returned in the $scaleX slot of the MacroPCA output.

  • maxdir
    The maximal number of random directions to use for computing the outlyingness of the data points. Default is maxdir = 250. If the number n of observations is small all n * (n - 1) / 2 pairs of observations are used.

  • distprob
    The quantile determining the cutoff values for orthogonal and score distances. Default is 0.99.

  • silent
    If TRUE, statements tracking the algorithm's progress will not be printed. Defaults to FALSE.

  • maxiter
    Maximum number of iterations. Default is 20.

  • tol
    Tolerance for iterations. Default is 0.005.

  • center
    if NULL, MacroPCA will compute the center. If a vector with d components, this center will be used.

  • bigOutput
    whether to compute and return NAimp, Cellimp and Fullimp. Defaults to TRUE.

Value

A list with components:

MacroPCApars

the options used in the call.

remX

Cleaned data after checkDataSet.

DDC

results of the first step of MacroPCA. These are needed to run MacroPCApredict on new data.

scaleX

the scales of the columns of X. When scale = FALSE these are all 1.

k

the number of principal components.

loadings

the columns are the k loading vectors.

eigenvalues

the k eigenvalues.

center

vector with the center.

alpha

alpha from the input.

h

h (computed from alpha).

It

number of iteration steps.

diff

convergence criterion.

X.NAimp

data with all NA's imputed by MacroPCA.

scores

scores of X.NAimp.

OD

orthogonal distances of the rows of X.NAimp.

cutoffOD

cutoff value for the OD.

SD

score distances of the rows of X.NAimp.

cutoffSD

cutoff value for the SD.

highOD

row numbers of cases whose OD is above cutoffOD.

highSD

row numbers of cases whose SD is above cutoffSD.

residScale

scale of the residuals.

stdResid

standardized residuals. Note that these are NA for all missing values of X.

indcells

indices of cellwise outliers.

NAimp

various results for the NA-imputed data.

Cellimp

various results for the cell-imputed data.

Fullimp

various result for the fully imputed data.

Author(s)

Rousseeuw P.J., Van den Bossche W.

References

Hubert, M., Rousseeuw, P.J., Van den Bossche W. (2019). MacroPCA: An all-in-one PCA method allowing for missing values as well as cellwise and rowwise outliers. Technometrics, 61(4), 459-473. (link to open access pdf)

See Also

checkDataSet, cellMap, DDC

Examples

library(MASS) 
set.seed(12345) 
n <- 50; d <- 10
A <- matrix(0.9, d, d); diag(A) = 1
x <- mvrnorm(n, rep(0,d), A)
x[sample(1:(n * d), 50, FALSE)] <- NA
x[sample(1:(n * d), 50, FALSE)] <- 10
MacroPCA.out <- MacroPCA(x, 2)
cellMap(MacroPCA.out$stdResid)

# For more examples, we refer to the vignette:
## Not run: 
vignette("MacroPCA_examples")

## End(Not run)

[Package cellWise version 2.5.3 Index]