R: Best subset selection with a specified model size

bess.one {BeSS}

R Documentation

Best subset selection with a specified model size

Description

Best subset selection with a specified model size for generalized linear models and Cox's proportional hazard model.

Usage

bess.one(x, y, family = c("gaussian", "binomial", "cox"),
         s = 1,
         max.steps = 15,
         glm.max = 1e6,
         cox.max = 20,
         factor = NULL,
         weights = rep(1,nrow(x)),
         normalize = TRUE)

Arguments

`x`	Input matrix,of dimension n x p; each row is an observation vector.
`y`	Response variable, of length n. For family = "`gaussian`", `y` should be a vector with continuous values. For family = "`binomial`", `y` should be a factor with two levels. For family = "`cox`", `y` should be a two-column matrix with columns named 'time' and 'status'.
`s`	Size of the selected model.It controls number of nonzero coefiicients to be allowed in the model.
`family`	One of the ditribution function for GLM or Cox models. Either "`gaussian`", "`binomial`", or "`cox`", depending on the response.
`max.steps`	The maximum number of iterations in the primal dual active set algorithm. In most cases, only a few steps can gurantee the convergence. Default is 15.
`glm.max`	The maximum number of iterations for solving the maximum likelihood problem on the active set. It occurs at each step in the primal dual active set algorithm. Only used in the logistic regression for family = "`binomial`". Default is `1e+6`.
`cox.max`	The maximum number of iterations for solving the maximum partial likelihood problem on the active set. It occurs at each step in the primal dual active set algorithm. Only used in Cox model for family = "`cox`". Default is 20.
`weights`	Observation weights. Default is 1 for each observation
`factor`	Which variable to be factored. Should be NULL or a numeric vector.
`normalize`	Whether to normalize `x` or not. Default is TRUE.

Details

Given a model size s, we consider the following best subset selection problem:

\min_\beta -2 logL(\beta) ;{ s.t.} \|\beta\|_0 = s.

In the GLM case, logL(\beta) is the log-likelihood function; In the Cox model, logL(\beta) is the log parital likelihood function.

The best subset selection problem is solved by the primal dual active set algorithm, see Wen et al. (2017) for details. This algorithm utilizes an active set updating strategy via primal and dual vairables and fits the sub-model by exploiting the fact that their support set are non-overlap and complementary.

Value

A list with class attribute 'bess.one' and named components:

`type`	Types of the model: "`bess_gaussian`" for linear model, "`bess_binomial`" for logistic model and "`bess_cox`" for Cox model
`beta`	The best fitting coefficients with the smallest loss function given the model size `s`.
`lambda`	The estimated lambda value in the Lagrangian form of the best subset selection problem with model size `s`.
`bestmodel`	The best fitted model, the class of which is "lm", "glm" or "coxph"
`deviance`	The value of `-2*logL(\beta)`.
`nulldeviance`	The value of `-2*logL(\beta)` for null model.
`factor`	Which variable to be factored. Should be NULL or a numeric vector.

Author(s)

Canhong Wen, Aijun Zhang, Shijie Quan, and Xueqin Wang.

References

Wen, C., Zhang, A., Quan, S. and Wang, X. (2020). BeSS: An R Package for Best Subset Selection in Linear, Logistic and Cox Proportional Hazards Models, Journal of Statistical Software, Vol. 94(4). doi:10.18637/jss.v094.i04.

Examples


#--------------linear model--------------#
# Generate simulated data

n <- 500
p <- 20
K <-10
sigma <- 1
rho <- 0.2
data <- gen.data(n, p, family = "gaussian", K, rho, sigma)


# Best subset selection
fit1 <- bess.one(data$x, data$y, s = 10, family = "gaussian", normalize = TRUE)
#coef(fit1,sparse=TRUE)
bestmodel <- fit1$bestmodel
#summary(bestmodel)

## Not run: 
#--------------logistic model--------------#

# Generate simulated data
data <- gen.data(n, p, family = "binomial", K, rho, sigma)

# Best subset selection
fit2 <- bess.one(data$x, data$y, family = "binomial", s = 10, normalize = TRUE)
bestmodel <- fit2$bestmodel
#summary(bestmodel)

#--------------cox model--------------#

# Generate simulated data
data <- gen.data(n, p, K, rho, sigma, c=10, family="cox", scal=10)

# Best subset selection
fit3 <- bess.one(data$x, data$y, s = 10, family = "cox", normalize = TRUE)
bestmodel <- fit3$bestmodel
#summary(bestmodel)

#----------------------High dimensional linear models--------------------#

p <- 1000
data <- gen.data(n, p, family = "gaussian", K, rho, sigma)

# Best subset selection
fit <- bess.one(data$x, data$y, s=10, family = "gaussian", normalize = TRUE)

#---------------prostate---------------#
data("prostate")
x = prostate[,-9]
y = prostate[,9]

fit.ungroup = bess.one(x, y, s=5)
fit.group = bess.one(x, y, s=5, factor = c("gleason"))

#---------------SAheart---------------#
data(SAheart)
y = SAheart[,5]
x = SAheart[,-5]
x$ldl[x$ldl<5] = 1
x$ldl[x$ldl>=5&x$ldl<10] = 2
x$ldl[x$ldl>=10] = 3

fit.ungroup = bess.one(x, y, s=5, family = "binomial")
fit.group = bess.one(x, y, s=5, factor = c("ldl"), family = "binomial")
## End(Not run)

[Package BeSS version 2.0.4 Index]