SSGL {SSGL}R Documentation

Spike-and-Slab Group Lasso for Group-Regularized Generalized Linear Models (GLMs)

Description

This is a function to implement group-regularized GLMs with the spike-and-slab group lasso (SSGL) penalty of Bai et al. (2022) and Bai (2023). The identity link function is used for Gaussian regression, the logit link is used for binomial regression, and the log link is used for Poisson, negative binomial, and gamma regression. If the covariates in each x_i are grouped according to known groups g=1, ..., G, then this function can estimate some of the G groups of coefficients as all zero, depending on the amount of regularization.

In addition, this function has the option of returning the generalized information criterion (GIC) of Fan and Tang (2013) for each regularization parameter in the grid lambda0. The GIC can be used for model selection and serves as a useful alternative to cross-validation. The formula for the GIC and a given \lambda_0 is

DIC(\lambda_0) = \frac{1}{n} Deviance_{\lambda_0} + a_n \times \nu),

where Deviance_{\lambda_0} is the deviance computed with the estimate of beta based on spike hyperparameter \lambda_0, \nu_0 is the number of nonzero elements in the estimated beta, and a_n is a sequence that diverges at a suitable rate relative to n. As recommended by Fan and Tang (2013), we set a_n = \{\log(\log(n))\}\log(p).

Usage

SSGL(Y, X, groups, 
     family=c("gaussian","binomial","poisson","negativebinomial","gamma"), 
     X_test, nb_size=1, gamma_shape=1, group_weights, n_lambda0=25, 
     lambda0, lambda1=1, a=1, b=dim(X)[2], 
     max_iter=100, tol = 1e-6, return_GIC=TRUE, print_lambda0=TRUE) 

Arguments

Y

n \times 1 vector of responses for training data.

X

n \times p design matrix for training data, where the jth column of X corresponds to the jth overall covariate.

groups

p-dimensional vector of group labels. The jth entry in groups should contain either the group number or the factor level name that the feature in the jth column of X belongs to. groups must be either a vector of integers or factors.

family

exponential dispersion family of the response variables. Allows for "gaussian", "binomial", "poisson", "negativebinomial", and "gamma". Note that for "negativebinomial", the size parameter must be specified in advance, while for "gamma", the shape parameter must be specified in advance.

X_test

n_{test} \times p design matrix for test data to calculate predictions. X_test must have the same number of columns as X, but not necessarily the same number of rows. If no test data is provided or if in-sample predictions are desired, then the function automatically sets X_test=X in order to calculate in-sample predictions.

nb_size

known size parameter \alpha in NB(\alpha,\mu_i) distribution for the responses if the user specifies family="negativebinomial". Default is nb_size=1. Ignored if family is not "gamma".

gamma_shape

known shape parameter \nu in Gamma(\mu_i,\nu) distribution for the responses if the user specifies family="gamma". Default is gamma_shape=1.

group_weights

group-specific, nonnegative weights for the penalty. Default is to use the square roots of the group sizes.

n_lambda0

number of spike hyperparameters L. Default is n_lambda0=25.

lambda0

grid of L spike hyperparameters \lambda_0. The user may specify either a scalar or a vector. If the user does not provide this, the program chooses the grid automatically.

lambda1

slab hyperparameter \lambda_1 in the SSGL prior. Default is lambda1=1.

a

shape hyperparameter for the Beta(a,b) prior on the mixing proportion in the SSGL prior. Default is a=1.

b

shape hyperparameter for the Beta(a,b) prior on the mixing proportion in the SSGL prior. Default is b=dim(X)[2].

max_iter

maximum number of iterations in the algorithm. Default is max_iter=100.

tol

convergence threshold for algorithm. Default is tol=1e-6.

return_GIC

Boolean variable for whether or not to return the GIC. Default is return_GIC=TRUE.

print_lambda0

Boolean variable for whether or not to print the current value in lambda0. Default is print_lambda0=TRUE.

Value

The function returns a list containing the following components:

lambda0

L \times 1 vector of spike hyperpameters lambda0 used to fit the model. lambda0 is displayed in descending order.

beta

p \times L matrix of estimated regression coefficients. The kth column in beta corresponds to the kth spike hyperparameter in lambda0.

beta0

L \times 1 vector of estimated intercepts. The kth entry in beta0 corresponds to the kth spike hyperparameter in lambda0.

classifications

G \times L matrix of classifications, where G is the number of groups. An entry of "1" indicates that the group was classified as nonzero, and an entry of "0" indicates that the group was classified as zero. The kth column of classifications corresponds to the kth spike hyperparameter in lambda0.

Y_pred

n_{test} \times L matrix of predicted mean response values \mu_{test} = E(Y_{test}) based on the test data in X_test (or training data X if no argument was specified for X_test). The kth column in Y_pred corresponds to the predictions for the kth spike hyperparameter in lambda0.

GIC

L \times 1 vector of GIC values. The kth entry of GIC corresponds to the kth entry in our lambda0 grid. This is not returned if return_GIC=FALSE.

lambda0_min

The value in lambda0 that minimizes GIC. This is not returned if return_GIC=FALSE.

min_index

The index of lambda0_min in lambda0. This is not returned if return_GIC=FALSE.

References

Bai, R. (2023). "Bayesian group regularization in generalized linear models with a continuous spike-and-slab prior." arXiv pre-print arXiv:2007.07021.

Bai, R., Moran, G. E., Antonelli, J. L., Chen, Y., and Boland, M.R. (2022). "Spike-and-slab group lassos for grouped regression and sparse generalized additive models." Journal of the American Statistical Association, 117:184-197.

Fan, Y. and Tang, C. Y. (2013). "Tuning parameter selection in high dimensional penalized likelihood." Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75:531-552.

Examples

## Generate data
set.seed(12345)
X = matrix(runif(100*10), nrow=100)
n = dim(X)[1]
groups = c("A","A","A","B","B","B","C","C","D","D")
groups = as.factor(groups)
beta_true = c(-2.5,1.5,1.5,0,0,0,2,-2,0,0)

## Generate responses from Gaussian distribution
Y = crossprod(t(X), beta_true) + rnorm(n)

## Generate test data
n_test = 50
X_test = matrix(runif(n_test*10), nrow=n_test)

## Fit SSGL model with 10 spike hyperparameters
## NOTE: If you do not specify lambda0, the program will automatically choose a suitable grid.
SSGL_mod = SSGL(Y, X, groups, family="gaussian", X_test, lambda0=seq(from=50,to=5,by=-5))

## Regression coefficient estimates
SSGL_mod$beta

## Predicted n_test-dimensional vectors mu=E(Y.test) based on test data, X_test. 
## The kth column of 'Y_pred' corresponds to the kth entry in 'lambda.'
SSGL_mod$Y_pred 

## Classifications of the 8 groups. The kth column of 'classifications'
## corresponds to the kth entry in 'lambda.'
SSGL_mod$classifications

## Plot lambda vs. GIC
plot(SSGL_mod$lambda0, SSGL_mod$GIC, type='l')

## Model selection with the lambda that minimizes GIC
SSGL_mod$lambda0_min
SSGL_mod$min_index 
SSGL_mod$classifications[, SSGL_mod$min_index]
SSGL_mod$beta[, SSGL_mod$min_index]



## Example with binary logistic regression

set.seed(12345)
X = matrix(runif(100*8), nrow=100)
n = dim(X)[1]
groups = c("A","A","A","B","B","B","C","C")
groups = as.factor(groups)
beta_true = c(-2.5,1.5,1.5,0,0,0,2,-2)

## Generate binary responses
eta = crossprod(t(X), beta_true)
Y = rbinom(n, size=1, prob=1/(1+exp(-eta)))

## Generate test data
n_test = 50
X_test = matrix(runif(n_test*8), nrow=n_test)

## Fit SSGL logistic regression model with 10 spike hyperparameters
## NOTE: If you do not specify lambda0, the program will automatically choose a suitable grid.
SSGL_logistic_mod = SSGL(Y, X, groups, family="binomial", X_test, lambda0=seq(from=10,to=1,by=-1.5))

## Regression coefficient estimates
SSGL_logistic_mod$beta

## Predicted n_test-dimensional vectors mu=E(Y_test) based on test data, X_test. 
## The kth column of 'Y_pred' corresponds to the kth entry in 'lambda.'
SSGL_logistic_mod$Y_pred

## Classifications of the 8 groups. The kth column of 'classifications'
## corresponds to the kth entry in 'lambda.'
SSGL_logistic_mod$classifications

## Plot lambda vs. GIC
plot(SSGL_logistic_mod$lambda0, SSGL_logistic_mod$GIC, type='l')

## Model selection with the lambda that minimizes GIC
SSGL_logistic_mod$lambda0_min
SSGL_logistic_mod$min_index 
SSGL_logistic_mod$classifications[, SSGL_logistic_mod$min_index]
SSGL_logistic_mod$beta[, SSGL_logistic_mod$min_index]


[Package SSGL version 1.0 Index]