abessrpca {abess} R Documentation

## Adaptive best subset selection for robust principal component analysis

### Description

Decompose a matrix into the summation of low-rank matrix and sparse matrix via the best subset selection approach

### Usage

abessrpca(
x,
rank,
support.size = NULL,
tune.path = c("sequence", "gsection"),
gs.range = NULL,
tune.type = c("gic", "aic", "bic", "ebic"),
ic.scale = 1,
lambda = 0,
always.include = NULL,
group.index = NULL,
c.max = NULL,
splicing.type = 2,
max.splicing.iter = 1,
warm.start = TRUE,
important.search = NULL,
max.newton.iter = 1,
newton.thresh = 0.001,
seed = 1,
...
)


### Arguments

 x A matrix object. rank A positive integer value specify the rank of the low-rank matrix. support.size An integer vector representing the alternative support sizes. Only used for tune.path = "sequence". Strongly suggest its minimum value larger than min(dim(x)). tune.path The method to be used to select the optimal support size. For tune.path = "sequence", we solve the best subset selection problem for each size in support.size. For tune.path = "gsection", we solve the best subset selection problem with support size ranged in gs.range, where the specific support size to be considered is determined by golden section. gs.range A integer vector with two elements. The first element is the minimum model size considered by golden-section, the later one is the maximum one. Default is gs.range = c(1, min(n, round(n/(log(log(n))log(p))))). tune.type The type of criterion for choosing the support size. Available options are "gic", "ebic", "bic" and "aic". Default is "gic". ic.scale A non-negative value used for multiplying the penalty term in information criterion. Default: ic.scale = 1. lambda A single lambda value for regularized best subset selection. Default is 0. always.include An integer vector containing the indexes of variables that should always be included in the model. group.index A vector of integers indicating the which group each variable is in. For variables in the same group, they should be located in adjacent columns of x and their corresponding index in group.index should be the same. Denote the first group as 1, the second 2, etc. If you do not fit a model with a group structure, please set group.index = NULL (the default). c.max an integer splicing size. Default is: c.max = 2. splicing.type Optional type for splicing. If splicing.type = 1, the number of variables to be spliced is c.max, ..., 1; if splicing.type = 2, the number of variables to be spliced is c.max, c.max/2, ..., 1. (Default: splicing.type = 2.) max.splicing.iter The maximum number of performing splicing algorithm. In most of the case, only a few times of splicing iteration can guarantee the convergence. Default is max.splicing.iter = 20. warm.start Whether to use the last solution as a warm start. Default is warm.start = TRUE. important.search An integer number indicating the number of important variables to be splicing. When important.search \ll p variables, it would greatly reduce runtimes. Default: important.search = 128. max.newton.iter a integer giving the maximal number of Newton's iteration iterations. Default is max.newton.iter = 10 if newton = "exact", and max.newton.iter = 60 if newton = "approx". newton.thresh a numeric value for controlling positive convergence tolerance. The Newton's iterations converge when |dev - dev_{old}|/(|dev| + 0.1)< newton.thresh. num.threads An integer decide the number of threads to be concurrently used for cross-validation (i.e., tune.type = "cv"). If num.threads = 0, then all of available cores will be used. Default: num.threads = 0. seed Seed to be used to divide the sample into cross-validation folds. Default is seed = 1. ... further arguments to be passed to or from methods.

### Details

Adaptive best subset selection for robust principal component analysis aim to find two latent matrices L and S such that the original matrix X can be appropriately approximated:

x = L + S + N,

where L is a low-rank matrix, S is a sparse matrix, N is a dense noise matrix. Generic splicing technique can be employed to solve this problem by iteratively improve the quality of the estimation of S.

For a given support set \Omega, the optimization problem:

\min_S \| x - L - S\|_F^2 \;\;{\rm s.t.}\;\; S_{ij} = 0 {\rm for } (i, j) \in \Omega^c,

still a non-convex optimization problem. We use the hard-impute algorithm proposed in one of the reference to solve this problem. The hard-impute algorithm is an iterative algorithm, people can set max.newton.iter and newton.thresh to control the solution precision of the optimization problem. (Here, the name of the two parameters are somehow abused to make the parameters cross functions have an unified name.) According to our experiments, we assign properly parameters to the two parameter as the default such that the precision and runtime are well balanced, we suggest users keep the default values unchanged.

### Value

A S3 abessrpca class object, which is a list with the following components:

 S A list with length(support.size) elements, each of which is a sparse matrix estimation; L The low rank matrix estimation. nobs The number of sample used for training. nvars The number of variables used for training. rank The rank of matrix L. loss The loss of objective function. tune.value A value of tuning criterion of length length(support.size). support.size The actual support.size values used. Note that it is not necessary the same as the input if the later have non-integer values or duplicated values. tune.type The criterion type for tuning parameters. call The original call to abessrpca.

### Note

Some parameters not described in the Details Section is explained in the document for abess because the meaning of these parameters are very similar.

At present, l_2 regularization and group selection are not support, and thus, set lambda and group.index have no influence on the output. This feature will coming soon.

### References

A polynomial algorithm for best-subset selection problem. Junxian Zhu, Canhong Wen, Jin Zhu, Heping Zhang, Xueqin Wang. Proceedings of the National Academy of Sciences Dec 2020, 117 (52) 33117-33123; doi: 10.1073/pnas.2014241117

Emmanuel J. Candès, Xiaodong Li, Yi Ma, and John Wright. 2011. Robust principal component analysis? Journal of the ACM. 58, 3, Article 11 (May 2011), 37 pages. doi: 10.1145/1970392.1970395

Mazumder, Rahul, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. The Journal of Machine Learning Research 11 (2010): 2287-2322.

### Examples


library(abess)
n <- 30
p <- 30
true_S_size <- 60
true_L_rank <- 2
dataset <- generate.matrix(n, p, support.size = true_S_size, rank = true_L_rank)
res <- abessrpca(dataset[["x"]], rank = true_L_rank, support.size = 50:70)
print(res)
coef(res)
plot(res, type = "tune")
plot(res, type = "loss")
plot(res, type = "S")



[Package abess version 0.4.6 Index]