abessrpca {abess}R Documentation

Adaptive best subset selection for robust principal component analysis

Description

Decompose a matrix into the summation of low-rank matrix and sparse matrix via the best subset selection approach

Usage

abessrpca(
  x,
  rank,
  support.size = NULL,
  tune.path = c("sequence", "gsection"),
  gs.range = NULL,
  tune.type = c("gic", "aic", "bic", "ebic"),
  ic.scale = 1,
  lambda = 0,
  always.include = NULL,
  group.index = NULL,
  c.max = NULL,
  splicing.type = 2,
  max.splicing.iter = 1,
  warm.start = TRUE,
  important.search = NULL,
  max.newton.iter = 1,
  newton.thresh = 0.001,
  num.threads = 0,
  seed = 1,
  ...
)

Arguments

x

A matrix object.

rank

A positive integer value specify the rank of the low-rank matrix.

support.size

An integer vector representing the alternative support sizes. Only used for tune.path = "sequence". Strongly suggest its minimum value larger than min(dim(x)).

tune.path

The method to be used to select the optimal support size. For tune.path = "sequence", we solve the best subset selection problem for each size in support.size. For tune.path = "gsection", we solve the best subset selection problem with support size ranged in gs.range, where the specific support size to be considered is determined by golden section.

gs.range

A integer vector with two elements. The first element is the minimum model size considered by golden-section, the later one is the maximum one. Default is gs.range = c(1, min(n, round(n/(log(log(n))log(p))))).

tune.type

The type of criterion for choosing the support size. Available options are "gic", "ebic", "bic" and "aic". Default is "gic".

ic.scale

A non-negative value used for multiplying the penalty term in information criterion. Default: ic.scale = 1.

lambda

A single lambda value for regularized best subset selection. Default is 0.

always.include

An integer vector containing the indexes of variables that should always be included in the model.

group.index

A vector of integers indicating the which group each variable is in. For variables in the same group, they should be located in adjacent columns of x and their corresponding index in group.index should be the same. Denote the first group as 1, the second 2, etc. If you do not fit a model with a group structure, please set group.index = NULL (the default).

c.max

an integer splicing size. Default is: c.max = 2.

splicing.type

Optional type for splicing. If splicing.type = 1, the number of variables to be spliced is c.max, ..., 1; if splicing.type = 2, the number of variables to be spliced is c.max, c.max/2, ..., 1. (Default: splicing.type = 2.)

max.splicing.iter

The maximum number of performing splicing algorithm. In most of the case, only a few times of splicing iteration can guarantee the convergence. Default is max.splicing.iter = 20.

warm.start

Whether to use the last solution as a warm start. Default is warm.start = TRUE.

important.search

An integer number indicating the number of important variables to be splicing. When important.search \ll p variables, it would greatly reduce runtimes. Default: important.search = 128.

max.newton.iter

a integer giving the maximal number of Newton's iteration iterations. Default is max.newton.iter = 10 if newton = "exact", and max.newton.iter = 60 if newton = "approx".

newton.thresh

a numeric value for controlling positive convergence tolerance. The Newton's iterations converge when |dev - dev_{old}|/(|dev| + 0.1)< newton.thresh.

num.threads

An integer decide the number of threads to be concurrently used for cross-validation (i.e., tune.type = "cv"). If num.threads = 0, then all of available cores will be used. Default: num.threads = 0.

seed

Seed to be used to divide the sample into cross-validation folds. Default is seed = 1.

...

further arguments to be passed to or from methods.

Details

Adaptive best subset selection for robust principal component analysis aim to find two latent matrices L and S such that the original matrix X can be appropriately approximated:

x = L + S + N,

where L is a low-rank matrix, S is a sparse matrix, N is a dense noise matrix. Generic splicing technique can be employed to solve this problem by iteratively improve the quality of the estimation of S.

For a given support set \Omega, the optimization problem:

\min_S \| x - L - S\|_F^2 \;\;{\rm s.t.}\;\; S_{ij} = 0 {\rm for } (i, j) \in \Omega^c,

still a non-convex optimization problem. We use the hard-impute algorithm proposed in one of the reference to solve this problem. The hard-impute algorithm is an iterative algorithm, people can set max.newton.iter and newton.thresh to control the solution precision of the optimization problem. (Here, the name of the two parameters are somehow abused to make the parameters cross functions have an unified name.) According to our experiments, we assign properly parameters to the two parameter as the default such that the precision and runtime are well balanced, we suggest users keep the default values unchanged.

Value

A S3 abessrpca class object, which is a list with the following components:

S

A list with length(support.size) elements, each of which is a sparse matrix estimation;

L

The low rank matrix estimation.

nobs

The number of sample used for training.

nvars

The number of variables used for training.

rank

The rank of matrix L.

loss

The loss of objective function.

tune.value

A value of tuning criterion of length length(support.size).

support.size

The actual support.size values used. Note that it is not necessary the same as the input if the later have non-integer values or duplicated values.

tune.type

The criterion type for tuning parameters.

call

The original call to abessrpca.

Note

Some parameters not described in the Details Section is explained in the document for abess because the meaning of these parameters are very similar.

At present, l_2 regularization and group selection are not support, and thus, set lambda and group.index have no influence on the output. This feature will coming soon.

References

A polynomial algorithm for best-subset selection problem. Junxian Zhu, Canhong Wen, Jin Zhu, Heping Zhang, Xueqin Wang. Proceedings of the National Academy of Sciences Dec 2020, 117 (52) 33117-33123; doi:10.1073/pnas.2014241117

Emmanuel J. Candès, Xiaodong Li, Yi Ma, and John Wright. 2011. Robust principal component analysis? Journal of the ACM. 58, 3, Article 11 (May 2011), 37 pages. doi:10.1145/1970392.1970395

Mazumder, Rahul, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. The Journal of Machine Learning Research 11 (2010): 2287-2322.

Examples


library(abess)
Sys.setenv("OMP_THREAD_LIMIT" = 2)
n <- 30
p <- 30
true_S_size <- 60
true_L_rank <- 2
dataset <- generate.matrix(n, p, support.size = true_S_size, rank = true_L_rank)
res <- abessrpca(dataset[["x"]], rank = true_L_rank, support.size = 50:70)
print(res)
coef(res)
plot(res, type = "tune")
plot(res, type = "loss")
plot(res, type = "S")


[Package abess version 0.4.8 Index]