SPCAvRP {SPCAvRP} | R Documentation |
Computes the leading eigenvector using the SPCAvRP algorithm
Description
Computes l
-sparse leading eigenvector of the sample covariance matrix, using A x B
random axis-aligned projections of dimension d
. For the multiple component estimation use SPCAvRP_subspace
or SPCAvRP_deflation
.
Usage
SPCAvRP(data, cov = FALSE, l, d = 20, A = 600, B = 200,
center_data = TRUE, parallel = FALSE,
cluster_type = "PSOCK", cores = 1, machine_names = NULL)
Arguments
data |
Either the data matrix ( |
cov |
|
l |
Desired sparsity level in the final estimator (see Details). |
d |
The dimension of the random projections (see Details). |
A |
Number of projections over which to aggregate (see Details). |
B |
Number of projections in a group from which to select (see Details). |
center_data |
|
parallel |
|
cluster_type |
If |
cores |
If |
machine_names |
If |
Details
This function implements the SPCAvRP algorithm for the principal component estimation (Algorithm 1 in the reference given below).
If the true sparsity level k
is known, use l = k
and d = k
.
If the true sparsity level k
is unknown, l
can take an array of different values and then the estimators of the corresponding sparsity levels are computed. The final choice of l
can then be done by the user via inspecting the explained variance computed in the output value
or via inspecting the output importance_scores
. The default choice for d
is 20
, but we suggest choosing d
equal to or slightly larger than l
.
It is desirable to choose A
(and B = ceiling(A/3)
) as big as possible subject to the computational budget. In general, we suggest using A = 300
and B = 100
when the dimension of data is a few hundreds, while A = 600
and B = 200
when the dimension is on order of 1000
.
If center_data == TRUE
and data
is given as a data matrix, the first step is to center it by executing scale(data, center_data, FALSE)
, which subtracts the column means of data
from their corresponding columns.
If parallel == TRUE
, the parallelised SPCAvRP algorithm is used. We recommend to use this option if p
, A
and B
are very large.
Value
Returns a list of three elements:
vector |
A matrix of dimension |
value |
An array with |
importance_scores |
An array of length p with importance scores for each variable 1 to p. |
Author(s)
Milana Gataric, Tengyao Wang and Richard J. Samworth
References
Milana Gataric, Tengyao Wang and Richard J. Samworth (2018) Sparse principal component analysis via random projections https://arxiv.org/abs/1712.05630
Examples
p <- 100 # data dimension
k <- 10 # true sparsity level
n <- 1000 # number of observations
v1 <- c(rep(1/sqrt(k), k), rep(0,p-k)) # true principal component
Sigma <- 2*tcrossprod(v1) + diag(p) # population covariance
mu <- rep(0, p) # population mean
loss = function(u,v){
# the loss function
sqrt(abs(1-sum(v*u)^2))
}
set.seed(1)
X <- mvrnorm(n, mu, Sigma) # data matrix
spcavrp <- SPCAvRP(data = X, cov = FALSE, l = k, d = k, A = 200, B = 70)
spcavrp.loss <- loss(v1,spcavrp$vector)
print(paste0("estimation loss when l=d=k=10, A=200, B=70: ", spcavrp.loss))
##choosing sparsity level l if k unknown:
#spcavrp.choosel <- SPCAvRP(data = X, cov = FALSE, l = c(1:30), d = 15, A = 200, B = 70)
#plot(1:p,spcavrp.choosel$importance_scores,xlab='variable',ylab='w',
# main='choosing l when k unknown: \n importance scores w')
#plot(1:30,spcavrp.choosel$value,xlab='l',ylab='Var_l',
# main='choosing l when k unknown: \n explained variance Var_l')