R: Computes the leading eigenvector using the SPCAvRP algorithm

SPCAvRP {SPCAvRP}

R Documentation

Computes the leading eigenvector using the SPCAvRP algorithm

Description

Computes l-sparse leading eigenvector of the sample covariance matrix, using A x B random axis-aligned projections of dimension d. For the multiple component estimation use SPCAvRP_subspace or SPCAvRP_deflation.

Usage

SPCAvRP(data, cov = FALSE, l, d = 20, A = 600, B = 200, 
center_data = TRUE, parallel = FALSE, 
cluster_type = "PSOCK", cores = 1, machine_names = NULL)

Arguments

`data`	Either the data matrix (`p x n`) or the sample covariance matrix (`p x p`).
`cov`	`TRUE` if data is given as a sample covariance matrix.
`l`	Desired sparsity level in the final estimator (see Details).
`d`	The dimension of the random projections (see Details).
`A`	Number of projections over which to aggregate (see Details).
`B`	Number of projections in a group from which to select (see Details).
`center_data`	`TRUE` if the data matrix should be centered (see Details).
`parallel`	`TRUE` if the selection step should be computed in parallel by uses package `"parallel"`.
`cluster_type`	If `parallel == TRUE`, this can be `"PSOCK"` or `"FORK"` (cf. package `"parallel"`).
`cores`	If `parallel == TRUE` and `cluster_type == "FORK"`, number of cores to use.
`machine_names`	If `parallel == TRUE`, the names of the computers on the network.

Details

This function implements the SPCAvRP algorithm for the principal component estimation (Algorithm 1 in the reference given below).

If the true sparsity level k is known, use l = k and d = k.

If the true sparsity level k is unknown, l can take an array of different values and then the estimators of the corresponding sparsity levels are computed. The final choice of l can then be done by the user via inspecting the explained variance computed in the output value or via inspecting the output importance_scores. The default choice for d is 20, but we suggest choosing d equal to or slightly larger than l.

It is desirable to choose A (and B = ceiling(A/3)) as big as possible subject to the computational budget. In general, we suggest using A = 300 and B = 100 when the dimension of data is a few hundreds, while A = 600 and B = 200 when the dimension is on order of 1000.

If center_data == TRUE and data is given as a data matrix, the first step is to center it by executing scale(data, center_data, FALSE), which subtracts the column means of data from their corresponding columns.

If parallel == TRUE, the parallelised SPCAvRP algorithm is used. We recommend to use this option if p, A and B are very large.

Value

Returns a list of three elements:

`vector`	A matrix of dimension `p x length(l)` with columns as the estimated eigenvectors of sparsity level `l`.
`value`	An array with `length(l)` eigenvalues corresponding to the estimated eigenvectors returned in `vector`.
`importance_scores`	An array of length p with importance scores for each variable 1 to p.

Author(s)

Milana Gataric, Tengyao Wang and Richard J. Samworth

References

Milana Gataric, Tengyao Wang and Richard J. Samworth (2018) Sparse principal component analysis via random projections https://arxiv.org/abs/1712.05630

Examples

p <- 100  # data dimension
k <- 10   # true sparsity level
n <- 1000 # number of observations
v1 <- c(rep(1/sqrt(k), k), rep(0,p-k)) # true principal component
Sigma <- 2*tcrossprod(v1) + diag(p)    # population covariance
mu <- rep(0, p)                        # population mean
loss = function(u,v){ 
  # the loss function
  sqrt(abs(1-sum(v*u)^2))
}
set.seed(1)
X <- mvrnorm(n, mu, Sigma) # data matrix

spcavrp <- SPCAvRP(data = X, cov = FALSE, l = k, d = k, A = 200, B = 70)
spcavrp.loss <- loss(v1,spcavrp$vector)
print(paste0("estimation loss when l=d=k=10, A=200, B=70: ", spcavrp.loss))

##choosing sparsity level l if k unknown:
#spcavrp.choosel <- SPCAvRP(data = X, cov = FALSE, l = c(1:30), d = 15, A = 200, B = 70)
#plot(1:p,spcavrp.choosel$importance_scores,xlab='variable',ylab='w',
#     main='choosing l when k unknown: \n importance scores w')
#plot(1:30,spcavrp.choosel$value,xlab='l',ylab='Var_l',
#     main='choosing l when k unknown: \n explained variance Var_l')

[Package SPCAvRP version 0.4 Index]