R: PCA algorithms

pcasvd {rchemo}

R Documentation

PCA algorithms

Description

Algorithms fitting a centered weighted PCA of a matrix X.

Noting D a (n, n) diagonal matrix of weights for the observations (rows of X), the functions consist in:

- pcasvd: SVD factorization of D^(1/2) * X, using function svd.

- pcaeigen:Eigen factorization of X' * D * X, using function eigen.

- pcaeigenk: Eigen factorization of D^(1/2) * X * X' D^(1/2), using function eigen. This is the "kernel cross-product trick" version of the PCA algorithm (Wu et al. 1997). For wide matrices (n << p) and n not too large, this algorithm can be much faster than the others.

- pcanipals: Eigen factorization of X' * D * X using NIPALS.

- pcanipalsna: Eigen factorization of X' * D * X using NIPALS allowing missing data in X.

- pcasph: Robust spherical PCA (Locantore et al. 1990, Maronna 2005, Daszykowski et al. 2007).

Function pcanipalsna accepts missing data (NAs) in X, unlike the other functions. The part of pcanipalsna accounting specifically for missing missing data is based on the efficient code of K. Wright in the R package nipals (https://cran.r-project.org/web/packages/nipals/index.html).

Gram-Schmidt orthogonalization in the NIPALS algorithm

The PCA NIPALS is known to generate a loss of orthogonality of the PCs (due to the accumulation of rounding errors in the successive iterations), particularly for large matrices or with high degrees of column collinearity.

With missing data, orthogonality of loadings is not satisfied neither.

An approach for coming back to orthogonality (PCs and loadings) is the iterative classical Gram-Schmidt orthogonalization (Lingen 2000, Andrecut 2009, and vignette of R package nipals), referred to as the iterative CGS. It consists in adding a CGS orthorgonalization step in each iteration of the PCs and loadings calculations.

For the case with missing data, the iterative CGS does not insure that the orthogonalized PCs are centered.

Auxiliary function

transform Calculates the PCs for any new matrix X from the model.

summary returns summary information for the model.

Usage


pcasvd(X, weights = NULL, nlv)

pcaeigen(X, weights = NULL, nlv)

pcaeigenk(X,weights = NULL,  nlv)

pcanipals(X, weights = NULL, nlv,
    gs = TRUE, 
    tol = .Machine$double.eps^0.5, maxit = 200)

pcanipalsna(X, nlv, 
    gs = TRUE,
    tol = .Machine$double.eps^0.5, maxit = 200)
    
pcasph(X, weights = NULL, nlv)
  
## S3 method for class 'Pca'
transform(object, X, ..., nlv = NULL)  

## S3 method for class 'Pca'
summary(object, X, ...)

Arguments

`X`	For the main functions and auxiliary function `summary`: Training X-data (`n, p`). — For the other auxiliary functions: New X-data (`m, p`) to consider.
`weights`	Weights (`n, 1`) to apply to the training observations. Internally, weights are "normalized" to sum to 1. Default to `NULL` (weights are set to `1 / n`).
`nlv`	The number of PCs to calculate.
`object`	A fitted model, output of a call to the main functions.
`...`	Optional arguments.

Specific for the NIPALS algorithm

`gs`	Logical indicating if a Gram-Schmidt orthogonalization is implemented or not (default to `TRUE`).
`tol`	Tolerance for testing convergence of the NIPALS iterations for each PC.
`maxit`	Maximum number of NIPALS iterations for each PC.

Value

A list of outputs, such as:

`T`	The score matrix (`n, nlv`).
`P`	The loadings matrix (`p, nlv`).
`R`	The projection matrix (= `P`) (`p, nlv`).
`sv`	The singular values (`min(n, p), 1`) except for NIPALS = (`nlv, 1`).
`eig`	The eigenvalues (`= sv^2`) (`min(n, p), 1`) except for NIPALS = (`nlv, 1`).
`xmeans`	The centering vector of `X` (`p, 1`).
`niter`	Numbers of iterations of the NIPALS.
`conv`	Logical indicating if the NIPALS converged before reaching the maximal number of iterations.

References

Andrecut, M., 2009. Parallel GPU Implementation of Iterative PCA Algorithms. Journal of Computational Biology 16, 1593-1599. https://doi.org/10.1089/cmb.2008.0221

Gabriel, R. K., 2002. Le biplot - Outil d\'exploration de données multidimensionnelles. Journal de la Société Française de la Statistique, 143, 5-55.

Lingen, F.J., 2000. Efficient Gram-Schmidt orthonormalisation on parallel computers. Communications in Numerical Methods in Engineering 16, 57-66. https://doi.org/10.1002/(SICI)1099-0887(200001)16:1<57::AID-CNM320>3.0.CO;2-I

Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris, France.

Wright, K., 2018. Package nipals: Principal Components Analysis using NIPALS with Gram-Schmidt Orthogonalization. https://cran.r-project.org/

Wu, W., Massart, D.L., de Jong, S., 1997. The kernel PCA algorithms for wide data. Part I: Theory and algorithms. Chemometrics and Intelligent Laboratory Systems 36, 165-172. https://doi.org/10.1016/S0169-7439(97)00010-5

For Spherical PCA:

Daszykowski, M., Kaczmarek, K., Vander Heyden, Y., Walczak, B., 2007. Robust statistics in data analysis - A review. Chemometrics and Intelligent Laboratory Systems 85, 203-219. https://doi.org/10.1016/j.chemolab.2006.06.016

Locantore N., Marron J.S., Simpson D.G., Tripoli N., Zhang J.T., Cohen K.L. Robust principal component analysis for functional data, Test 8 (1999) 1-7

Maronna, R., 2005. Principal components and orthogonal regression based on robust scales, Technometrics, 47:3, 264-273, DOI: 10.1198/004017005000000166

Examples


n <- 6 ; p <- 4
Xtrain <- matrix(rnorm(n * p), nrow = n)
s <- c(3, 4, 7, 10, 11, 15, 21:24)   
zX <- replace(Xtrain, s, NA)
Xtrain
zX
m <- 2
Xtest <- matrix(rnorm(m * p), nrow = m)

pcasvd(Xtrain, nlv = 3)
pcaeigen(Xtrain, nlv = 3)
pcaeigenk(Xtrain, nlv = 3)
pcanipals(Xtrain, nlv = 3)
pcanipalsna(Xtrain, nlv = 3)
pcanipalsna(zX, nlv = 3)

fm <- pcaeigen(Xtrain, nlv = 3)
fm$T
transform(fm, Xtest)
transform(fm, Xtest, nlv = 2)

pcaeigen(Xtrain, nlv = 3)$T
pcaeigen(Xtrain, nlv = 3, weights = 1:n)$T


Ttrain <- fm$T
Ttest <- transform(fm, Xtest)
T <- rbind(Ttrain, Ttest)
group <- c(rep("Training", nrow(Ttrain)), rep("Test", nrow(Ttest)))
i <- 1
plotxy(T[, i:(i+1)], group = group, pch = 16, zeroes = TRUE, cex = 1.3, main = "scores")

plotxy(fm$P, zeroes = TRUE, label = TRUE, cex = 2, col = "red3", main ="loadings")

summary(fm, Xtrain)
res <- summary(fm, Xtrain)
plotxy(res$cor.circle, zeroes = TRUE, label = TRUE, cex = 2, col = "red3",
    circle = TRUE, ylim = c(-1, 1))

[Package rchemo version 0.1-2 Index]