R: Spike and Slab Regression

spikeslab {spikeslab}

R Documentation

Spike and Slab Regression

Description

Fits a rescaled spike and slab model using a continuous bimodal prior. A generalized elastic net estimator is used for variable selection and estimation. Can be used for prediction and variable selection in low- and high-dimensional linear regression models.

Usage

spikeslab(formula, data = NULL, x = NULL, y = NULL,
    n.iter1 = 500, n.iter2 = 500, mse = TRUE,
    bigp.smalln = FALSE, bigp.smalln.factor = 1, screen = (bigp.smalln),
    r.effects = NULL, max.var = 500, center = TRUE, intercept = TRUE,
    fast = TRUE, beta.blocks = 5, verbose = FALSE, ntree = 300,
    seed = NULL, ...)

Arguments

`formula`	A symbolic description of the model to be fit.
`data`	Data frame containing the data used in the formula.
`x`	x predictor matrix (can be used in place of formula and data frame call).
`y`	y response (can be used in place of formula and data frame call).
`n.iter1`	Number of burn-in Gibbs sampled values (i.e., discarded values).
`n.iter2`	Number of Gibbs sampled values, following burn-in.
`mse`	If TRUE, an external estimate for the overall variance is calculated using ridge regression or random forests (the latter is used when the degrees of freedom are low). Otherwise, the variance is included in the prior and estimated using Gibbs sampling.
`bigp.smalln`	Use if `p` >> `n`.
`bigp.smalln.factor`	Removes all variables except the top `n`times `bigp.smalln.factor` ones (used in filtering when `p` >> `n`).
`screen`	If TRUE, variables are pre-filtered.
`r.effects`	List used for grouping variables (see details below).
`max.var`	Maximum number of variables allowed in the final model.
`center`	If TRUE, variables are centered by their means. Default is TRUE and should only be adjusted in extreme examples.
`intercept`	If TRUE, an intercept is included in the model, otherwise no intercept is included. Default is TRUE.
`fast`	If TRUE, use blocked Gibbs sampling to accelerate the algorithm.
`beta.blocks`	Update beta using this number of blocks (`fast` must be TRUE).
`verbose`	If TRUE, verbose output is sent to the terminal.
`ntree`	Number of trees used by random forests (applies only when `mse` is TRUE).
`seed`	Seed for random number generator. Must be a negative integer.
`...`	Further arguments passed to or from other methods.

Details

—> General:

The spike and slab method is described in detail in Ishwaran and Rao (2003, 2005a, 2005b and 2009). For high-dimensional problems in which p >> n, where p is the number of variables and n is the sample size, use the option bigp.smalln=TRUE. Doing so implements a three-stage procedure:

(1) Filtering step. This removes all variables except the top n times bigp.smalln.factor ones. Uses spike and slab regression with grouped regularization (complexity) parameters.

(2) Model averaging step. Refit the model using only those predictors from step 1. Returns the posterior mean values from fitting a spike and slab model; referred to as the Bayesian model averaged (bma) estimate.

(3) Variable selection step. Select variables using the generalized elastic net (gnet).

The filtering step is omitted when bigp.smalln=FALSE. Filtering can however be requested by setting screen=TRUE although users should be aware that this may degrade performance and should only be used when p is on the same order of n.

Variables can be grouped using r.effects. Grouping has the effect of forcing variables within a given group to share a common complexity (regularization) parameter. To do so, define a list with each entry in the list made up of the variable names to be grouped. There is no limit to the number of groups. Any variable that does not appear in the list will be assigned to a default group (the default group also has its own group-specific regularization parameter). See Examples 1 and 3 below.

—> Miscellanea:

By default, fast=TRUE when bigp.smalln=TRUE. This invokes an ultra-fast filtering step. Setting fast=FALSE invokes a more thorough filtering method that may slightly improve inferential results, but computational times will become very slow. The trade-off is unlikely to be justified.

The formula and data-frame call should be avoided in high-dimensional problems and instead the x-predictor matrix and y response vector should be passed directly (see Example 3). This avoids the huge overhead in parsing formula in R.

By default, predictors are normalized to have mean 0 and variance 1. Pre-processing also involves centering y unless the user specifically requests that the intercept be excluded from the model. Users can also over-ride centering predictors by setting center=FALSE. Use with extreme care.

The verbose option sends output to the terminal showing the number of Gibbs iterations and the current complexity (regularization) parameter(s).

Depends on the randomForest package for estimating the variance when mse=TRUE. Note that mse is over-ridden and set to FALSE when bigp.smalln=TRUE.

Depends on the lars package for the variable slection step.

Value

An object of class spikeslab with the following components:

`summary`	Summary object.
`verbose`	Verbose details (used for printing).
`terms`	Terms.
`sigma.hat`	Estimated variance.
`y`	Original y.
`xnew`	Centered, rescaled x-matrix.
`x`	Original x-matrix.
`y.center`	Centering for original y.
`x.center`	Centering for original x-matrix.
`x.scale`	Scaling for original x-matrix.
`names`	Variable names.
`bma`	bma coefficients in terms of xnew.
`bma.scale`	bma coefficients rescaled in terms of original x.
`gnet`	gnet coefficients in terms of xnew.
`gnet.scale`	gnet coefficients rescaled in terms of original x.
`gnet.path`	gnet path scaled in terms of the original x.
`gnet.obj`	gnet object (a lars-type object).
`gnet.obj.vars`	Variables (in order) used to calculate the gnet object.
`gnet.parms`	Generalized ridge regression parameters used to define the gnet.
`phat`	Estimated model dimension.
`complexity`	Complexity (regularization) parameter estimates.
`ridge`	List containing ridge values used to determine the bma.
`models`	List containing the models sampled.

Author(s)

Hemant Ishwaran (hemant.ishwaran@gmail.com)

J. Sunil Rao (rao.jsunil@gmail.com)

Udaya B. Kogalur (ubk@kogalur.com)

References

Breiman L. (2001). Random forests, Machine Learning, 45:5-32.

Efron B., Hastie T., Johnstone I. and Tibshirani R. (2004). Least angle regression (with discussion). Ann. Statist., 32:407-499.

Ishwaran H. and Rao J.S. (2003). Detecting differentially expressed genes in microarrays using Bayesian model selection. J. Amer. Stat. Assoc., 98:438-455.

Ishwaran H. and Rao J.S. (2005a). Spike and slab variable selection: frequentist and Bayesian strategies. Ann. Statist., 33:730-773.

Ishwaran H. and Rao J.S. (2005b). Spike and slab gene selection for multigroup microarray data. J. Amer. Stat. Assoc., 100:764-780.

Ishwaran H. and Rao J.S. (2010). Generalized ridge regression: geometry and computational solutions when p is larger than n.

Ishwaran H., Kogalur U.B. and Rao J.S. (2010). spikeslab: prediction and variable selection using spike and slab regression. R Journal, 2(2), 68-73.

Ishwaran H. and Rao J.S. (2011). Mixing generalized ridge regressions.

Zou H. and Hastie T. (2005). Regularization and variable selection via the elastic net. J. Royal Statist. Society B, 67(2):301-320.

Examples


#------------------------------------------------------------
# Example 1:  diabetes data
#------------------------------------------------------------

# basic call
data(diabetesI, package = "spikeslab")
obj <- spikeslab(Y ~ . , diabetesI, verbose=TRUE)
print(obj)
plot(obj)

# grouping effect
# separate main effects and interactions into two groups
# use a group-specific regularization parameter for each group
xnames <- names(diabetesI[, -1])
r.eff <- vector("list", 2)
r.eff[[1]] <- xnames[c(1:10)]
r.eff[[2]] <- xnames[-c(1:10)]
obj2 <- spikeslab(Y ~ . , diabetesI, verbose=TRUE, r.effects=r.eff)
obj2
# extract the regularization parameters
print(apply(obj2$complexity, 2, summary))

## Not run: 
#------------------------------------------------------------
# Example 2: high-dimensional noise (diabetes data)
#------------------------------------------------------------

# add 2000 noise variables
data(diabetesI, package = "spikeslab")
diabetes.noise <- cbind(diabetesI,
      noise = matrix(rnorm(nrow(diabetesI) * 2000), nrow(diabetesI)))

# example of a big p, small n call
# don't use formula call; make call with x and y arguments
x <- diabetes.noise[, -1]
y <- diabetes.noise[, 1]
obj <- spikeslab(x=x, y=y, verbose=TRUE, bigp.smalln=TRUE, max.var=100)
obj

# same example ... but now group variables 
r.eff <- vector("list", 2)
r.eff[[1]] <- names(x)[c(1:100)]
r.eff[[2]] <- names(x)[-c(1:100)]
obj2 <- spikeslab(x=x, y=y, verbose=TRUE, bigp.smalln=TRUE,
                 r.effects=r.eff, max.var=100)
obj2

#------------------------------------------------------------
# Example 3: housing data with interactions
#------------------------------------------------------------

# another example of a big p, small n call
data(housingI, package = "spikeslab")
obj <- spikeslab(medv ~ ., housingI, verbose = TRUE,
           bigp.smalln = TRUE, max.var = 200)
print(obj)



## End(Not run)

[Package spikeslab version 1.1.6 Index]