abess.default {abess}R Documentation

Adaptive Best-Subset Selection via Splicing

Description

Adaptive best-subset selection for regression, (multi-class) classification, counting-response, censored-response, multi-response modeling in polynomial times.

Usage

## Default S3 method:
abess(
  x,
  y,
  family = c("gaussian", "binomial", "poisson", "cox", "mgaussian", "multinomial"),
  tune.path = c("sequence", "gsection"),
  tune.type = c("gic", "ebic", "bic", "aic", "cv"),
  weight = NULL,
  normalize = NULL,
  c.max = 2,
  support.size = NULL,
  gs.range = NULL,
  lambda = 0,
  always.include = NULL,
  group.index = NULL,
  splicing.type = 2,
  max.splicing.iter = 20,
  screening.num = NULL,
  important.search = NULL,
  warm.start = TRUE,
  nfolds = 5,
  cov.update = FALSE,
  newton = c("exact", "approx"),
  newton.thresh = 1e-06,
  max.newton.iter = NULL,
  early.stop = FALSE,
  num.threads = 0,
  seed = 1,
  ...
)

## S3 method for class 'formula'
abess(formula, data, subset, na.action, ...)

Arguments

x

Input matrix, of dimension n \times p; each row is an observation vector and each column is a predictor/feature/variable. Can be in sparse matrix format (inherit from class "dgCMatrix" in package Matrix).

y

The response variable, of n observations. For family = "binomial" should have two levels. For family="poisson", y should be a vector with positive integer. For family = "cox", y should be a Surv object returned by the survival package (recommended) or a two-column matrix with columns named "time" and "status". For family = "mgaussian", y should be a matrix of quantitative responses. For family = "multinomial", y should be a factor of at least three levels. Note that, for either "binomial" or "multinomial", if y is presented as a numerical vector, it will be coerced into a factor.

family

One of the following models: "gaussian" (continuous response), "binomial" (binary response), "poisson" (non-negative count), "cox" (left-censored response), "mgaussian" (multivariate continuous response). Depending on the response. Any unambiguous substring can be given.

tune.path

The method to be used to select the optimal support size. For tune.path = "sequence", we solve the best subset selection problem for each size in support.size. For tune.path = "gsection", we solve the best subset selection problem with support size ranged in gs.range, where the specific support size to be considered is determined by golden section.

tune.type

The type of criterion for choosing the support size. Available options are "gic", "ebic", "bic", "aic" and "cv". Default is "gic".

weight

Observation weights. When weight = NULL, we set weight = 1 for each observation as default.

normalize

Options for normalization. normalize = 0 for no normalization. normalize = 1 for subtracting the mean of columns of x. normalize = 2 for scaling the columns of x to have √ n norm. normalize = 3 for subtracting the means of the columns of x and y, and also normalizing the columns of x to have √ n norm. If normalize = NULL, normalize will be set 1 for "gaussian", 2 for "binomial". Default is normalize = NULL.

c.max

an integer splicing size. Default is: c.max = 2.

support.size

An integer vector representing the alternative support sizes. Only used for tune.path = "sequence". Default is 0:min(n, round(n/(log(log(n))log(p)))).

gs.range

A integer vector with two elements. The first element is the minimum model size considered by golden-section, the later one is the maximum one. Default is gs.range = c(1, min(n, round(n/(log(log(n))log(p))))). Not available now.

lambda

A single lambda value for regularized best subset selection. Default is 0.

always.include

An integer vector containing the indexes of variables that should always be included in the model.

group.index

A vector of integers indicating the which group each variable is in. For variables in the same group, they should be located in adjacent columns of x and their corresponding index in group.index should be the same. Denote the first group as 1, the second 2, etc. If you do not fit a model with a group structure, please set group.index = NULL (the default).

splicing.type

Optional type for splicing. If splicing.type = 1, the number of variables to be spliced is c.max, ..., 1; if splicing.type = 2, the number of variables to be spliced is c.max, c.max/2, ..., 1. (Default: splicing.type = 2.)

max.splicing.iter

The maximum number of performing splicing algorithm. In most of the case, only a few times of splicing iteration can guarantee the convergence. Default is max.splicing.iter = 20.

screening.num

An integer number. Preserve screening.num number of predictors with the largest marginal maximum likelihood estimator before running algorithm.

important.search

An integer number indicating the number of important variables to be splicing. When important.search \ll p variables, it would greatly reduce runtimes. Default: important.search = 128.

warm.start

Whether to use the last solution as a warm start. Default is warm.start = TRUE.

nfolds

The number of folds in cross-validation. Default is nfolds = 5.

cov.update

A logical value only used for family = "gaussian". If cov.update = TRUE, use a covariance-based implementation; otherwise, a naive implementation. The naive method is more efficient than covariance-based method when p >> n and important.search is much large than its default value. Default: cov.update = FALSE.

newton

A character specify the Newton's method for fitting generalized linear models, it should be either newton = "exact" or newton = "approx". If newton = "exact", then the exact hessian is used, while newton = "approx" uses diagonal entry of the hessian, and can be faster (especially when family = "cox").

newton.thresh

a numeric value for controlling positive convergence tolerance. The Newton's iterations converge when |dev - dev_{old}|/(|dev| + 0.1)< newton.thresh.

max.newton.iter

a integer giving the maximal number of Newton's iteration iterations. Default is max.newton.iter = 10 if newton = "exact", and max.newton.iter = 60 if newton = "approx".

early.stop

A boolean value decide whether early stopping. If early.stop = TRUE, algorithm will stop if the last tuning value less than the existing one. Default: early.stop = FALSE.

num.threads

An integer decide the number of threads to be concurrently used for cross-validation (i.e., tune.type = "cv"). If num.threads = 0, then all of available cores will be used. Default: num.threads = 0.

seed

Seed to be used to divide the sample into cross-validation folds. Default is seed = 1.

...

further arguments to be passed to or from methods.

formula

an object of class "formula": a symbolic description of the model to be fitted. The details of model specification are given in the "Details" section of "formula".

data

a data frame containing the variables in the formula.

subset

an optional vector specifying a subset of observations to be used.

na.action

a function which indicates what should happen when the data contain NAs. Defaults to getOption("na.action").

Details

Best-subset selection aims to find a small subset of predictors, so that the resulting model is expected to have the most desirable prediction accuracy. Best-subset selection problem under the support size s is

\min_β -2 \log L(β) \;\;{\rm s.t.}\;\; \|β\|_0 ≤q s,

where L(β) is arbitrary convex functions. In the GLM case, \log L(β) is the log-likelihood function; in the Cox model, \log L(β) is the log partial-likelihood function.

The best subset selection problem is solved by the "abess" algorithm in this package, see Zhu (2020) for details. Under mild conditions, the algorithm exactly solve this problem in polynomial time. This algorithm exploits the idea of sequencing and splicing to reach a stable solution in finite steps when s is fixed. To find the optimal support size s, we provide various criterion like GIC, AIC, BIC and cross-validation error to determine it.

Value

A S3 abess class object, which is a list with the following components:

beta

A p-by-length(support.size) matrix of coefficients for univariate family, stored in column format; while a list of length(support.size) coefficients matrix (with size p-by-ncol(y)) for multivariate family.

intercept

An intercept vector of length length(support.size) for univariate family; while a list of length(support.size) intercept vector (with size ncol(y)) for multivariate family.

dev

the deviance of length length(support.size).

tune.value

A value of tuning criterion of length length(support.size).

nobs

The number of sample used for training.

nvars

The number of variables used for training.

family

Type of the model.

tune.path

The path type for tuning parameters.

support.size

The actual support.size values used. Note that it is not necessary the same as the input if the later have non-integer values or duplicated values.

edf

The effective degree of freedom. It is the same as support.size when lambda = 0.

best.size

The best support size selected by the tuning value.

tune.type

The criterion type for tuning parameters.

tune.path

The strategy for tuning parameters.

screening.vars

The character vector specify the feature selected by feature screening. It would be an empty character vector if screening.num = 0.

call

The original call to abess.

Author(s)

Jin Zhu, Junxian Zhu, Canhong Wen, Heping Zhang, Xueqin Wang

References

A polynomial algorithm for best-subset selection problem. Junxian Zhu, Canhong Wen, Jin Zhu, Heping Zhang, Xueqin Wang. Proceedings of the National Academy of Sciences Dec 2020, 117 (52) 33117-33123; DOI: 10.1073/pnas.2014241117

Sure independence screening for ultrahigh dimensional feature space. Fan, J. and Lv, J. (2008), Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70: 849-911. https://doi.org/10.1111/j.1467-9868.2008.00674.x

Targeted Inference Involving High-Dimensional Data Using Nuisance Penalized Regression. Qiang Sun & Heping Zhang (2020). Journal of the American Statistical Association, DOI: 10.1080/01621459.2020.1737079

Certifiably Polynomial Algorithm for Best Group Subset Selection. Zhang, Yanhang, Junxian Zhu, Jin Zhu, and Xueqin Wang (2021). arXiv preprint arXiv:2104.12576.

See Also

print.abess, predict.abess, coef.abess, extract.abess, plot.abess, deviance.abess.

Examples


library(abess)
n <- 100
p <- 20
support.size <- 3

################ linear model ################
dataset <- generate.data(n, p, support.size)
abess_fit <- abess(dataset[["x"]], dataset[["y"]])
## helpful generic functions:
print(abess_fit)
coef(abess_fit, support.size = 3)
predict(abess_fit,
  newx = dataset[["x"]][1:10, ],
  support.size = c(3, 4)
)
str(extract(abess_fit, 3))
deviance(abess_fit)
plot(abess_fit)

################ logistic model ################
dataset <- generate.data(n, p, support.size, family = "binomial")
## allow cross-validation to tuning
abess_fit <- abess(dataset[["x"]], dataset[["y"]],
  family = "binomial", tune.type = "cv"
)
abess_fit

################ poisson model ################
dataset <- generate.data(n, p, support.size, family = "poisson")
abess_fit <- abess(dataset[["x"]], dataset[["y"]],
  family = "poisson", tune.type = "cv"
)
abess_fit

################ Cox model ################
dataset <- generate.data(n, p, support.size, family = "cox")
abess_fit <- abess(dataset[["x"]], dataset[["y"]],
  family = "cox", tune.type = "cv"
)

################ Multivariate gaussian model ################
dataset <- generate.data(n, p, support.size, family = "mgaussian")
abess_fit <- abess(dataset[["x"]], dataset[["y"]],
  family = "mgaussian", tune.type = "cv"
)
plot(abess_fit, type = "l2norm")

################ Multinomial model (multi-classification) ################
dataset <- generate.data(n, p, support.size, family = "multinomial")
abess_fit <- abess(dataset[["x"]], dataset[["y"]],
  family = "multinomial", tune.type = "cv"
)
predict(abess_fit,
  newx = dataset[["x"]][1:10, ],
  support.size = c(3, 4), type = "response"
)

########## Best group subset selection #############
dataset <- generate.data(n, p, support.size)
group_index <- rep(1:10, each = 2)
abess_fit <- abess(dataset[["x"]], dataset[["y"]], group.index = group_index)
str(extract(abess_fit))

################ Golden section searching ################
dataset <- generate.data(n, p, support.size)
abess_fit <- abess(dataset[["x"]], dataset[["y"]], tune.path = "gsection")
abess_fit

################ Feature screening ################
p <- 1000
dataset <- generate.data(n, p, support.size)
abess_fit <- abess(dataset[["x"]], dataset[["y"]],
  screening.num = 100
)
str(extract(abess_fit))

################ Sparse predictor ################
require(Matrix)
p <- 1000
dataset <- generate.data(n, p, support.size)
dataset[["x"]][abs(dataset[["x"]]) < 1] <- 0
dataset[["x"]] <- Matrix(dataset[["x"]])
abess_fit <- abess(dataset[["x"]], dataset[["y"]])
str(extract(abess_fit))


################  Formula interface  ################
data("trim32")
abess_fit <- abess(y ~ ., data = trim32)
abess_fit


[Package abess version 0.3.0 Index]