R: Spike-and-Slab Group Lasso for Group-Regularized Generalized...

SSGL {SSGL}

R Documentation

Spike-and-Slab Group Lasso for Group-Regularized Generalized Linear Models (GLMs)

Description

This is a function to implement group-regularized GLMs with the spike-and-slab group lasso (SSGL) penalty of Bai et al. (2022) and Bai (2023). The identity link function is used for Gaussian regression, the logit link is used for binomial regression, and the log link is used for Poisson, negative binomial, and gamma regression. If the covariates in each x_i are grouped according to known groups g=1, ..., G, then this function can estimate some of the G groups of coefficients as all zero, depending on the amount of regularization.

In addition, this function has the option of returning the generalized information criterion (GIC) of Fan and Tang (2013) for each regularization parameter in the grid lambda0. The GIC can be used for model selection and serves as a useful alternative to cross-validation. The formula for the GIC and a given \lambda_0 is

DIC(\lambda_0) = \frac{1}{n} Deviance_{\lambda_0} + a_n \times \nu),

where Deviance_{\lambda_0} is the deviance computed with the estimate of beta based on spike hyperparameter \lambda_0, \nu_0 is the number of nonzero elements in the estimated beta, and a_n is a sequence that diverges at a suitable rate relative to n. As recommended by Fan and Tang (2013), we set a_n = \{\log(\log(n))\}\log(p).

Usage

SSGL(Y, X, groups, 
     family=c("gaussian","binomial","poisson","negativebinomial","gamma"), 
     X_test, nb_size=1, gamma_shape=1, group_weights, n_lambda0=25, 
     lambda0, lambda1=1, a=1, b=dim(X)[2], 
     max_iter=100, tol = 1e-6, return_GIC=TRUE, print_lambda0=TRUE)

Arguments

`Y`	`n \times 1` vector of responses for training data.
`X`	`n \times p` design matrix for training data, where the `j`th column of `X` corresponds to the `j`th overall covariate.
`groups`	`p`-dimensional vector of group labels. The `j`th entry in `groups` should contain either the group number or the factor level name that the feature in the `j`th column of `X` belongs to. `groups` must be either a vector of integers or factors.
`family`	exponential dispersion family of the response variables. Allows for `"gaussian"`, `"binomial"`, `"poisson"`, `"negativebinomial"`, and `"gamma"`. Note that for `"negativebinomial"`, the size parameter must be specified in advance, while for `"gamma"`, the shape parameter must be specified in advance.
`X_test`	`n_{test} \times p` design matrix for test data to calculate predictions. `X_test` must have the same number of columns as X, but not necessarily the same number of rows. If no test data is provided or if in-sample predictions are desired, then the function automatically sets `X_test=X` in order to calculate in-sample predictions.
`nb_size`	known size parameter `\alpha` in `NB(\alpha,\mu_i)` distribution for the responses if the user specifies `family="negativebinomial"`. Default is `nb_size=1`. Ignored if `family` is not `"gamma"`.
`gamma_shape`	known shape parameter `\nu` in `Gamma(\mu_i,\nu)` distribution for the responses if the user specifies `family="gamma"`. Default is `gamma_shape=1`.
`group_weights`	group-specific, nonnegative weights for the penalty. Default is to use the square roots of the group sizes.
`n_lambda0`	number of spike hyperparameters `L`. Default is `n_lambda0=25`.
`lambda0`	grid of `L` spike hyperparameters `\lambda_0`. The user may specify either a scalar or a vector. If the user does not provide this, the program chooses the grid automatically.
`lambda1`	slab hyperparameter `\lambda_1` in the SSGL prior. Default is `lambda1=1`.
`a`	shape hyperparameter for the `Beta(a,b)` prior on the mixing proportion in the SSGL prior. Default is `a=1`.
`b`	shape hyperparameter for the `Beta(a,b)` prior on the mixing proportion in the SSGL prior. Default is `b=dim(X)[2]`.
`max_iter`	maximum number of iterations in the algorithm. Default is `max_iter=100`.
`tol`	convergence threshold for algorithm. Default is `tol=1e-6`.
`return_GIC`	Boolean variable for whether or not to return the GIC. Default is `return_GIC=TRUE`.
`print_lambda0`	Boolean variable for whether or not to print the current value in `lambda0`. Default is `print_lambda0=TRUE`.

Value

The function returns a list containing the following components:

`lambda0`	`L \times 1` vector of spike hyperpameters `lambda0` used to fit the model. `lambda0` is displayed in descending order.
`beta`	`p \times L` matrix of estimated regression coefficients. The `k`th column in `beta` corresponds to the `k`th spike hyperparameter in `lambda0`.
`beta0`	`L \times 1` vector of estimated intercepts. The `k`th entry in `beta0` corresponds to the `k`th spike hyperparameter in `lambda0`.
`classifications`	`G \times L` matrix of classifications, where `G` is the number of groups. An entry of "1" indicates that the group was classified as nonzero, and an entry of "0" indicates that the group was classified as zero. The `k`th column of `classifications` corresponds to the `k`th spike hyperparameter in `lambda0`.
`Y_pred`	`n_{test} \times L` matrix of predicted mean response values `\mu_{test} = E(Y_{test})` based on the test data in `X_test` (or training data `X` if no argument was specified for `X_test`). The `k`th column in `Y_pred` corresponds to the predictions for the `k`th spike hyperparameter in `lambda0`.
`GIC`	`L \times 1` vector of GIC values. The `k`th entry of `GIC` corresponds to the `k`th entry in our `lambda0` grid. This is not returned if `return_GIC=FALSE`.
`lambda0_min`	The value in `lambda0` that minimizes `GIC`. This is not returned if `return_GIC=FALSE`.
`min_index`	The index of `lambda0_min` in `lambda0`. This is not returned if `return_GIC=FALSE`.

References

Bai, R. (2023). "Bayesian group regularization in generalized linear models with a continuous spike-and-slab prior." arXiv pre-print arXiv:2007.07021.

Bai, R., Moran, G. E., Antonelli, J. L., Chen, Y., and Boland, M.R. (2022). "Spike-and-slab group lassos for grouped regression and sparse generalized additive models." Journal of the American Statistical Association, 117:184-197.

Fan, Y. and Tang, C. Y. (2013). "Tuning parameter selection in high dimensional penalized likelihood." Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75:531-552.

Examples

## Generate data
set.seed(12345)
X = matrix(runif(100*10), nrow=100)
n = dim(X)[1]
groups = c("A","A","A","B","B","B","C","C","D","D")
groups = as.factor(groups)
beta_true = c(-2.5,1.5,1.5,0,0,0,2,-2,0,0)

## Generate responses from Gaussian distribution
Y = crossprod(t(X), beta_true) + rnorm(n)

## Generate test data
n_test = 50
X_test = matrix(runif(n_test*10), nrow=n_test)

## Fit SSGL model with 10 spike hyperparameters
## NOTE: If you do not specify lambda0, the program will automatically choose a suitable grid.
SSGL_mod = SSGL(Y, X, groups, family="gaussian", X_test, lambda0=seq(from=50,to=5,by=-5))

## Regression coefficient estimates
SSGL_mod$beta

## Predicted n_test-dimensional vectors mu=E(Y.test) based on test data, X_test. 
## The kth column of 'Y_pred' corresponds to the kth entry in 'lambda.'
SSGL_mod$Y_pred 

## Classifications of the 8 groups. The kth column of 'classifications'
## corresponds to the kth entry in 'lambda.'
SSGL_mod$classifications

## Plot lambda vs. GIC
plot(SSGL_mod$lambda0, SSGL_mod$GIC, type='l')

## Model selection with the lambda that minimizes GIC
SSGL_mod$lambda0_min
SSGL_mod$min_index 
SSGL_mod$classifications[, SSGL_mod$min_index]
SSGL_mod$beta[, SSGL_mod$min_index]



## Example with binary logistic regression

set.seed(12345)
X = matrix(runif(100*8), nrow=100)
n = dim(X)[1]
groups = c("A","A","A","B","B","B","C","C")
groups = as.factor(groups)
beta_true = c(-2.5,1.5,1.5,0,0,0,2,-2)

## Generate binary responses
eta = crossprod(t(X), beta_true)
Y = rbinom(n, size=1, prob=1/(1+exp(-eta)))

## Generate test data
n_test = 50
X_test = matrix(runif(n_test*8), nrow=n_test)

## Fit SSGL logistic regression model with 10 spike hyperparameters
## NOTE: If you do not specify lambda0, the program will automatically choose a suitable grid.
SSGL_logistic_mod = SSGL(Y, X, groups, family="binomial", X_test, lambda0=seq(from=10,to=1,by=-1.5))

## Regression coefficient estimates
SSGL_logistic_mod$beta

## Predicted n_test-dimensional vectors mu=E(Y_test) based on test data, X_test. 
## The kth column of 'Y_pred' corresponds to the kth entry in 'lambda.'
SSGL_logistic_mod$Y_pred

## Classifications of the 8 groups. The kth column of 'classifications'
## corresponds to the kth entry in 'lambda.'
SSGL_logistic_mod$classifications

## Plot lambda vs. GIC
plot(SSGL_logistic_mod$lambda0, SSGL_logistic_mod$GIC, type='l')

## Model selection with the lambda that minimizes GIC
SSGL_logistic_mod$lambda0_min
SSGL_logistic_mod$min_index 
SSGL_logistic_mod$classifications[, SSGL_logistic_mod$min_index]
SSGL_logistic_mod$beta[, SSGL_logistic_mod$min_index]

[Package SSGL version 1.0 Index]