R: Adaptive best subset selection for robust principal component...

abessrpca {abess}

R Documentation

Adaptive best subset selection for robust principal component analysis

Description

Decompose a matrix into the summation of low-rank matrix and sparse matrix via the best subset selection approach

Usage

abessrpca(
  x,
  rank,
  support.size = NULL,
  tune.path = c("sequence", "gsection"),
  gs.range = NULL,
  tune.type = c("gic", "aic", "bic", "ebic"),
  ic.scale = 1,
  lambda = 0,
  always.include = NULL,
  group.index = NULL,
  c.max = NULL,
  splicing.type = 2,
  max.splicing.iter = 1,
  warm.start = TRUE,
  important.search = NULL,
  max.newton.iter = 1,
  newton.thresh = 0.001,
  num.threads = 0,
  seed = 1,
  ...
)

Arguments

`x`	A matrix object.
`rank`	A positive integer value specify the rank of the low-rank matrix.
`support.size`	An integer vector representing the alternative support sizes. Only used for `tune.path = "sequence"`. Strongly suggest its minimum value larger than `min(dim(x))`.
`tune.path`	The method to be used to select the optimal support size. For `tune.path = "sequence"`, we solve the best subset selection problem for each size in `support.size`. For `tune.path = "gsection"`, we solve the best subset selection problem with support size ranged in `gs.range`, where the specific support size to be considered is determined by golden section.
`gs.range`	A integer vector with two elements. The first element is the minimum model size considered by golden-section, the later one is the maximum one. Default is `gs.range = c(1, min(n, round(n/(log(log(n))log(p)))))`.
`tune.type`	The type of criterion for choosing the support size. Available options are "gic", "ebic", "bic" and "aic". Default is "gic".
`ic.scale`	A non-negative value used for multiplying the penalty term in information criterion. Default: `ic.scale = 1`.
`lambda`	A single lambda value for regularized best subset selection. Default is 0.
`always.include`	An integer vector containing the indexes of variables that should always be included in the model.
`group.index`	A vector of integers indicating the which group each variable is in. For variables in the same group, they should be located in adjacent columns of `x` and their corresponding index in `group.index` should be the same. Denote the first group as `1`, the second `2`, etc. If you do not fit a model with a group structure, please set `group.index = NULL` (the default).
`c.max`	an integer splicing size. Default is: `c.max = 2`.
`splicing.type`	Optional type for splicing. If `splicing.type = 1`, the number of variables to be spliced is `c.max`, ..., `1`; if `splicing.type = 2`, the number of variables to be spliced is `c.max`, `c.max/2`, ..., `1`. (Default: `splicing.type = 2`.)
`max.splicing.iter`	The maximum number of performing splicing algorithm. In most of the case, only a few times of splicing iteration can guarantee the convergence. Default is `max.splicing.iter = 20`.
`warm.start`	Whether to use the last solution as a warm start. Default is `warm.start = TRUE`.
`important.search`	An integer number indicating the number of important variables to be splicing. When `important.search` `\ll` `p` variables, it would greatly reduce runtimes. Default: `important.search = 128`.
`max.newton.iter`	a integer giving the maximal number of Newton's iteration iterations. Default is `max.newton.iter = 10` if `newton = "exact"`, and `max.newton.iter = 60` if `newton = "approx"`.
`newton.thresh`	a numeric value for controlling positive convergence tolerance. The Newton's iterations converge when `\|dev - dev_{old}\|/(\|dev\| + 0.1)<` `newton.thresh`.
`num.threads`	An integer decide the number of threads to be concurrently used for cross-validation (i.e., `tune.type = "cv"`). If `num.threads = 0`, then all of available cores will be used. Default: `num.threads = 0`.
`seed`	Seed to be used to divide the sample into cross-validation folds. Default is `seed = 1`.
`...`	further arguments to be passed to or from methods.

Details

Adaptive best subset selection for robust principal component analysis aim to find two latent matrices L and S such that the original matrix X can be appropriately approximated:

x = L + S + N,

where L is a low-rank matrix, S is a sparse matrix, N is a dense noise matrix. Generic splicing technique can be employed to solve this problem by iteratively improve the quality of the estimation of S.

For a given support set \Omega, the optimization problem:

\min_S \| x - L - S\|_F^2 \;\;{\rm s.t.}\;\; S_{ij} = 0 {\rm for } (i, j) \in \Omega^c,

still a non-convex optimization problem. We use the hard-impute algorithm proposed in one of the reference to solve this problem. The hard-impute algorithm is an iterative algorithm, people can set max.newton.iter and newton.thresh to control the solution precision of the optimization problem. (Here, the name of the two parameters are somehow abused to make the parameters cross functions have an unified name.) According to our experiments, we assign properly parameters to the two parameter as the default such that the precision and runtime are well balanced, we suggest users keep the default values unchanged.

Value

A S3 abessrpca class object, which is a list with the following components:

`S`	A list with `length(support.size)` elements, each of which is a sparse matrix estimation;
`L`	The low rank matrix estimation.
`nobs`	The number of sample used for training.
`nvars`	The number of variables used for training.
`rank`	The rank of matrix `L`.
`loss`	The loss of objective function.
`tune.value`	A value of tuning criterion of length `length(support.size)`.
`support.size`	The actual support.size values used. Note that it is not necessary the same as the input if the later have non-integer values or duplicated values.
`tune.type`	The criterion type for tuning parameters.
`call`	The original call to `abessrpca`.

Note

Some parameters not described in the Details Section is explained in the document for abess because the meaning of these parameters are very similar.

At present, l_2 regularization and group selection are not support, and thus, set lambda and group.index have no influence on the output. This feature will coming soon.

References

A polynomial algorithm for best-subset selection problem. Junxian Zhu, Canhong Wen, Jin Zhu, Heping Zhang, Xueqin Wang. Proceedings of the National Academy of Sciences Dec 2020, 117 (52) 33117-33123; doi:10.1073/pnas.2014241117

Emmanuel J. Candès, Xiaodong Li, Yi Ma, and John Wright. 2011. Robust principal component analysis? Journal of the ACM. 58, 3, Article 11 (May 2011), 37 pages. doi:10.1145/1970392.1970395

Mazumder, Rahul, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. The Journal of Machine Learning Research 11 (2010): 2287-2322.

Examples


library(abess)
Sys.setenv("OMP_THREAD_LIMIT" = 2)
n <- 30
p <- 30
true_S_size <- 60
true_L_rank <- 2
dataset <- generate.matrix(n, p, support.size = true_S_size, rank = true_L_rank)
res <- abessrpca(dataset[["x"]], rank = true_L_rank, support.size = 50:70)
print(res)
coef(res)
plot(res, type = "tune")
plot(res, type = "loss")
plot(res, type = "S")

[Package abess version 0.4.8 Index]