abess.default {abess} | R Documentation |
Adaptive best subset selection (for generalized linear model)
Description
Adaptive best-subset selection for regression, (multi-class) classification, counting-response, censored-response, positive response, multi-response modeling in polynomial times.
Usage
## Default S3 method:
abess(
x,
y,
family = c("gaussian", "binomial", "poisson", "cox", "mgaussian", "multinomial",
"gamma", "ordinal"),
tune.path = c("sequence", "gsection"),
tune.type = c("gic", "ebic", "bic", "aic", "cv"),
weight = NULL,
normalize = NULL,
fit.intercept = TRUE,
beta.low = -.Machine$double.xmax,
beta.high = .Machine$double.xmax,
c.max = 2,
support.size = NULL,
gs.range = NULL,
lambda = 0,
always.include = NULL,
group.index = NULL,
init.active.set = NULL,
splicing.type = 2,
max.splicing.iter = 20,
screening.num = NULL,
important.search = NULL,
warm.start = TRUE,
nfolds = 5,
foldid = NULL,
cov.update = FALSE,
newton = c("exact", "approx"),
newton.thresh = 1e-06,
max.newton.iter = NULL,
early.stop = FALSE,
ic.scale = 1,
num.threads = 0,
seed = 1,
...
)
## S3 method for class 'formula'
abess(formula, data, subset, na.action, ...)
Arguments
x |
Input matrix, of dimension |
y |
The response variable, of |
family |
One of the following models:
|
tune.path |
The method to be used to select the optimal support size. For
|
tune.type |
The type of criterion for choosing the support size.
Available options are |
weight |
Observation weights. When |
normalize |
Options for normalization.
|
fit.intercept |
A boolean value indicating whether to fit an intercept.
We assume the data has been centered if |
beta.low |
A single value specifying the lower bound of |
beta.high |
A single value specifying the upper bound of |
c.max |
an integer splicing size. Default is: |
support.size |
An integer vector representing the alternative support sizes.
Only used for |
gs.range |
A integer vector with two elements.
The first element is the minimum model size considered by golden-section,
the later one is the maximum one. Default is |
lambda |
A single lambda value for regularized best subset selection. Default is 0. |
always.include |
An integer vector containing the indexes of variables that should always be included in the model. |
group.index |
A vector of integers indicating the which group each variable is in.
For variables in the same group, they should be located in adjacent columns of |
init.active.set |
A vector of integers indicating the initial active set.
Default: |
splicing.type |
Optional type for splicing.
If |
max.splicing.iter |
The maximum number of performing splicing algorithm.
In most of the case, only a few times of splicing iteration can guarantee the convergence.
Default is |
screening.num |
An integer number. Preserve |
important.search |
An integer number indicating the number of
important variables to be splicing.
When |
warm.start |
Whether to use the last solution as a warm start. Default is |
nfolds |
The number of folds in cross-validation. Default is |
foldid |
an optional integer vector of values between 1, ..., nfolds identifying what fold each observation is in.
The default |
cov.update |
A logical value only used for |
newton |
A character specify the Newton's method for fitting generalized linear models,
it should be either |
newton.thresh |
a numeric value for controlling positive convergence tolerance.
The Newton's iterations converge when |
max.newton.iter |
a integer giving the maximal number of Newton's iteration iterations.
Default is |
early.stop |
A boolean value decide whether early stopping.
If |
ic.scale |
A non-negative value used for multiplying the penalty term
in information criterion. Default: |
num.threads |
An integer decide the number of threads to be
concurrently used for cross-validation (i.e., |
seed |
Seed to be used to divide the sample into cross-validation folds.
Default is |
... |
further arguments to be passed to or from methods. |
formula |
an object of class " |
data |
a data frame containing the variables in the |
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates
what should happen when the data contain |
Details
Best-subset selection aims to find a small subset of predictors,
so that the resulting model is expected to have the most desirable prediction accuracy.
Best-subset selection problem under the support size s
is
\min_\beta -2 \log L(\beta) \;\;{\rm s.t.}\;\; \|\beta\|_0 \leq s,
where L(\beta)
is arbitrary convex functions. In
the GLM case, \log L(\beta)
is the log-likelihood function; in the Cox
model, \log L(\beta)
is the log partial-likelihood function.
The best subset selection problem is solved by the splicing algorithm in this package, see Zhu (2020) for details.
Under mild conditions, the algorithm exactly solve this problem in polynomial time.
This algorithm exploits the idea of sequencing and splicing to reach a stable solution in finite steps when s
is fixed.
The parameters c.max
, splicing.type
and max.splicing.iter
allow user control the splicing technique flexibly.
On the basis of our numerical experiment results, we assign properly parameters to the these parameters as the default
such that the precision and runtime are well balanced, we suggest users keep the default values unchanged.
Please see this online page for more details about the splicing algorithm.
To find the optimal support size s
,
we provide various criterion like GIC, AIC, BIC and cross-validation error to determine it.
More specifically, the sequence of models implied by support.size
are fit by the splicing algorithm.
And the solved model with least information criterion or cross-validation error is the optimal model.
The sequential searching for the optimal model is somehow time-wasting.
A faster strategy is golden section (GS), which only need to specify gs.range
.
More details about GS is referred to Zhang et al (2021).
It is worthy to note that the parameters newton
, max.newton.iter
and newton.thresh
allows
user control the parameter estimation in non-gaussian models.
The parameter estimation procedure use Newton method or approximated Newton method (only consider the diagonal elements in the Hessian matrix).
Again, we suggest to use the default values unchanged because the same reason for the parameter c.max
.
abess
support some well-known advanced statistical methods to analyze data, including
sure independent screening: helpful for ultra-high dimensional predictors (i.e.,
p \gg n
). Use the parameterscreening.num
to retain the marginally most important predictors. See Fan et al (2008) for more details.best subset of group selection: helpful when predictors have group structure. Use the parameter
group.index
to specify the group structure of predictors. See Zhang et al (2021) for more details.l_2
regularization best subset selection: helpful when signal-to-ratio is relatively small. Use the parameterlambda
to control the magnitude of the regularization term.nuisance selection: helpful when the prior knowledge of important predictors is available. Use the parameter
always.include
to retain the important predictors.
The arbitrary combination of the four methods are definitely support.
Please see online vignettes for more details about the advanced features support by abess
.
Value
A S3 abess
class object, which is a list
with the following components:
beta |
A |
intercept |
An intercept vector of length |
dev |
the deviance of length |
tune.value |
A value of tuning criterion of length |
nobs |
The number of sample used for training. |
nvars |
The number of variables used for training. |
family |
Type of the model. |
tune.path |
The path type for tuning parameters. |
support.size |
The actual |
edf |
The effective degree of freedom.
It is the same as |
best.size |
The best support size selected by the tuning value. |
tune.type |
The criterion type for tuning parameters. |
tune.path |
The strategy for tuning parameters. |
screening.vars |
The character vector specify the feature
selected by feature screening.
It would be an empty character vector if |
call |
The original call to |
Author(s)
Jin Zhu, Junxian Zhu, Canhong Wen, Heping Zhang, Xueqin Wang
References
A polynomial algorithm for best-subset selection problem. Junxian Zhu, Canhong Wen, Jin Zhu, Heping Zhang, Xueqin Wang. Proceedings of the National Academy of Sciences Dec 2020, 117 (52) 33117-33123; doi:10.1073/pnas.2014241117
A Splicing Approach to Best Subset of Groups Selection. Zhang, Yanhang, Junxian Zhu, Jin Zhu, and Xueqin Wang (2023). INFORMS Journal on Computing, 35:1, 104-119. doi:10.1287/ijoc.2022.1241
abess: A Fast Best-Subset Selection Library in Python and R. Zhu Jin, Xueqin Wang, Liyuan Hu, Junhao Huang, Kangkang Jiang, Yanhang Zhang, Shiyun Lin, and Junxian Zhu. Journal of Machine Learning Research 23, no. 202 (2022): 1-7.
Sure independence screening for ultrahigh dimensional feature space. Fan, J. and Lv, J. (2008), Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70: 849-911. doi:10.1111/j.1467-9868.2008.00674.x
Targeted Inference Involving High-Dimensional Data Using Nuisance Penalized Regression. Qiang Sun & Heping Zhang (2020). Journal of the American Statistical Association, doi:10.1080/01621459.2020.1737079
See Also
print.abess
,
predict.abess
,
coef.abess
,
extract.abess
,
plot.abess
,
deviance.abess
.
Examples
library(abess)
Sys.setenv("OMP_THREAD_LIMIT" = 2)
n <- 100
p <- 20
support.size <- 3
################ linear model ################
dataset <- generate.data(n, p, support.size)
abess_fit <- abess(dataset[["x"]], dataset[["y"]])
## helpful generic functions:
print(abess_fit)
coef(abess_fit, support.size = 3)
predict(abess_fit,
newx = dataset[["x"]][1:10, ],
support.size = c(3, 4)
)
str(extract(abess_fit, 3))
deviance(abess_fit)
plot(abess_fit)
plot(abess_fit, type = "tune")
################ logistic model ################
dataset <- generate.data(n, p, support.size, family = "binomial")
## allow cross-validation to tuning
abess_fit <- abess(dataset[["x"]], dataset[["y"]],
family = "binomial", tune.type = "cv"
)
abess_fit
################ poisson model ################
dataset <- generate.data(n, p, support.size, family = "poisson")
abess_fit <- abess(dataset[["x"]], dataset[["y"]],
family = "poisson", tune.type = "cv"
)
abess_fit
################ Cox model ################
dataset <- generate.data(n, p, support.size, family = "cox")
abess_fit <- abess(dataset[["x"]], dataset[["y"]],
family = "cox", tune.type = "cv"
)
################ Multivariate gaussian model ################
dataset <- generate.data(n, p, support.size, family = "mgaussian")
abess_fit <- abess(dataset[["x"]], dataset[["y"]],
family = "mgaussian", tune.type = "cv"
)
plot(abess_fit, type = "l2norm")
################ Multinomial model (multi-classification) ################
dataset <- generate.data(n, p, support.size, family = "multinomial")
abess_fit <- abess(dataset[["x"]], dataset[["y"]],
family = "multinomial", tune.type = "cv"
)
predict(abess_fit,
newx = dataset[["x"]][1:10, ],
support.size = c(3, 4), type = "response"
)
################ Ordinal regression ################
dataset <- generate.data(n, p, support.size, family = "ordinal", class.num = 4)
abess_fit <- abess(dataset[["x"]], dataset[["y"]],
family = "ordinal", tune.type = "cv"
)
coef <- coef(abess_fit, support.size = abess_fit[["best.size"]])[[1]]
predict(abess_fit,
newx = dataset[["x"]][1:10, ],
support.size = c(3, 4), type = "response"
)
########## Best group subset selection #############
dataset <- generate.data(n, p, support.size)
group_index <- rep(1:10, each = 2)
abess_fit <- abess(dataset[["x"]], dataset[["y"]], group.index = group_index)
str(extract(abess_fit))
################ Golden section searching ################
dataset <- generate.data(n, p, support.size)
abess_fit <- abess(dataset[["x"]], dataset[["y"]], tune.path = "gsection")
abess_fit
################ Feature screening ################
p <- 1000
dataset <- generate.data(n, p, support.size)
abess_fit <- abess(dataset[["x"]], dataset[["y"]],
screening.num = 100
)
str(extract(abess_fit))
################ Sparse predictor ################
require(Matrix)
p <- 1000
dataset <- generate.data(n, p, support.size)
dataset[["x"]][abs(dataset[["x"]]) < 1] <- 0
dataset[["x"]] <- Matrix(dataset[["x"]])
abess_fit <- abess(dataset[["x"]], dataset[["y"]])
str(extract(abess_fit))
################ Formula interface ################
data("trim32")
abess_fit <- abess(y ~ ., data = trim32)
abess_fit