R: SODA algorithm for variable and interaction selection

soda {sodavis}

R Documentation

SODA algorithm for variable and interaction selection

Description

SODA is a forward-backward variable and interaction selection algorithm under logistic regression model with second-order terms. In the forward stage, a stepwise procedure is conducted to screen for important predictors with both main and interaction effects, and in the backward stage SODA remove insignificant terms so as to optimize the extended BIC (EBIC) criterion. SODA is applicable for variable selection for logistic regression, linear/quadratic discriminant analysis and other discriminant analysis with generative model being in exponential family.

Usage

soda(xx, yy, norm = F, debug = F, gam = 0, minF = 3)

Arguments

`xx`	The design matrix, of dimensions n * p, without an intercept. Each row is an observation vector.
`yy`	The response vector of dimension n * 1.
`norm`	Logical flag for xx variable quantile normalization to standard normal, prior to performing SODA algorithm. Default is norm=FALSE. Quantile-normalization is suggested if the data contains obvious outliers.
`debug`	Logical flag for printing debug information.
`gam`	Tuning paramter gamma in extended BIC criterion. EBIC for selected set S: EBIC = -2 * log-likelihood + \|S\| * log(n) + 2 * \|S\| * gamma * log(p)
`minF`	Minimum number of steps in forward interaction screening. Default is minF=3.

Value

`EBIC`	Trace of extended Bayesian information criterion (EBIC) score.
`Type`	Trace of step type ("Forward (Main)", "Forward (Int)", "Backward").
`Var`	Trace of selected variables.
`Term`	Trace of selected main and interaction terms.
`final_EBIC`	Final selected term set EBIC score.
`final_Var`	Final selected variables.
`final_Term`	Final selected main and interaction terms.

Author(s)

Yang Li, Jun S. Liu

References

Li Y, Liu JS. (2017). Robust Variable and Interaction Selection for Logistic Regression and Multiple Index Models. Technical Report.

Examples

# # (uncomment the code to run)
# # simulation study with 1 main effect and 2 interactions
# N = 250;
# p = 1000;
# r = 0.5;
# s = 1;
# H = abs(outer(1:p, 1:p, "-"))
# S = s * r^H;
# S[cbind(1:p, 1:p)] = S[cbind(1:p, 1:p)] * s

# xx = as.matrix(data.frame(mvrnorm(N, rep(0,p), S)));
# zz = 1 + xx[,1] - xx[,10]^2 + xx[,10]*xx[,20];
# yy = as.numeric(runif(N) < exp(zz) / (1+exp(zz)))

# res_SODA = soda(xx, yy, gam=0.5);
# cv_SODA  = soda_trace_CV(xx, yy, res_SODA)
# cv_SODA

# # Michigan lung cancer dataset
# data(mich_lung);
# res_SODA = soda(mich_lung_xx, mich_lung_yy, gam=0.5);
# cv_SODA  = soda_trace_CV(mich_lung_xx, mich_lung_yy, res_SODA)
# cv_SODA

[Package sodavis version 1.2 Index]