R: Best Orthogonalized Subset Selection (BOSS).

boss {BOSSreg}

R Documentation

Best Orthogonalized Subset Selection (BOSS).

Description

Compute the solution path of BOSS and forward stepwise selection (FS).
Compute various information criteria based on a heuristic degrees of freedom (hdf) that can serve as the selection rule to choose the subset given by BOSS.

Usage

boss(
  x,
  y,
  maxstep = min(nrow(x) - intercept - 1, ncol(x)),
  intercept = TRUE,
  hdf.ic.boss = TRUE,
  mu = NULL,
  sigma = NULL,
  ...
)

Arguments

`x`	A matrix of predictors, with `nrow(x)=length(y)=n` observations and `ncol(x)=p` predictors. Intercept shall NOT be included.
`y`	A vector of response variable, with `length(y)=n`.
`maxstep`	Maximum number of steps performed. Default is `min(n-1,p)` if `intercept=FALSE`, and it is `min(n-2, p)` otherwise.
`intercept`	Logical, whether to include an intercept term. Default is TRUE.
`hdf.ic.boss`	Logical, whether to calculate the heuristic degrees of freedom (hdf) and information criteria (IC) for BOSS. IC includes AIC, BIC, AICc, BICc, GCV, Cp. Default is TRUE.
`mu`	True mean vector, used in the calculation of hdf. Default is NULL, and is estimated via least-squares (LS) regression of y upon x for n>p, and 10-fold CV cross-validated lasso estimate for n<=p.
`sigma`	True standard deviation of the error, used in the calculation of hdf. Default is NULL, and is estimated via least-squares (LS) regression of y upon x for n>p, and 10-fold cross-validated lasso for n<=p.
`...`	Extra parameters to allow flexibility. Currently none allows or requires, just for the convinience of call from other parent functions like cv.boss.

Details

This function computes the full solution path given by BOSS and FS on a given dataset (x,y) with n observations and p predictors. It also calculates the heuristic degrees of freedom for BOSS, and various information criteria, which can further be used to select the subset from the candidates. Please refer to the Vignette for implementation details and Tian et al. (2021) for methodology details (links are given below).

Value

beta_fs: A matrix of regression coefficients for all the subsets given by FS, from a null model until stop, with nrow=p and ncol=min(n,p)+1, where min(n,p) is the maximum number of steps performed.
beta_boss: A matrix of regression coefficients for all the subsets given by BOSS, with nrow=p and ncol=min(n,p)+1. Note that unlike beta_fs and due to the nature of BOSS, the number of non-zero components in columns of beta_boss may not be unique, i.e. there maybe multiple columns corresponding to the same size of subset.
steps_x: A vector of numbers representing which predictor joins at each step, with length(steps)=min(n,p). The ordering is determined by the partial correlation between a predictor x_j and the response y.
steps_q: A vector of numbers representing which predictor joins at each step in the orthogonal basis, with length(steps)=min(n,p). BOSS takes the ordered predictors (ordering given in steps_x) and performs best subset regression upon their orthogonal basis, which is essentially ordering the orthogonalized predictors by their marginal correlations with the response y. For example, steps_q=c(2,1) indicates that the orthogonal basis of x_2 joins first.
hdf_boss: A vector of heuristic degrees of freedom (hdf) for BOSS, with length(hdf_boss)=p+1. Note that hdf_boss=NULL if n<=p or hdf.ic.boss=FALSE.
IC_boss: A list of information criteria (IC) for BOSS, where each element in the list is a vector representing values of a given IC for each candidate subset of BOSS (or each column in beta_boss). The output IC includes AIC, BIC, AICc, BICc, GCV and Mallows' Cp. Note that each IC is calculated by plugging in hdf_boss.
sigma: estimated error standard deviation. It is only returned when hdf is calculated, i.e. hdf.ic.boss=TRUE.

Author(s)

Sen Tian

References

Tian, S., Hurvich, C. and Simonoff, J. (2021), On the Use of Information Criteria for Subset Selection in Least Squares Regression. https://arxiv.org/abs/1911.10191
Reid, S., Tibshirani, R. and Friedman, J. (2016), A Study of Error Variance Estimation in Lasso Regression. Statistica Sinica, P35-67, JSTOR.
BOSSreg Vignette https://github.com/sentian/BOSSreg/blob/master/r-package/vignettes/BOSSreg.pdf

Examples

## Generate a trivial dataset, X has mean 0 and norm 1, y has mean 0
set.seed(11)
n = 20
p = 5
x = matrix(rnorm(n*p), nrow=n, ncol=p)
x = scale(x, center = colMeans(x))
x = scale(x, scale = sqrt(colSums(x^2)))
beta = c(1, 1, 0, 0, 0)
y = x%*%beta + scale(rnorm(n, sd=0.01), center = TRUE, scale = FALSE)

## Fit the model
boss_result = boss(x, y)

## Get the coefficient vector selected by AICc-hdf (S3 method for boss)
beta_boss_aicc = coef(boss_result)
# the above is equivalent to the following
beta_boss_aicc = boss_result$beta_boss[, which.min(boss_result$IC_boss$aicc), drop=FALSE]
## Get the fitted values of BOSS-AICc-hdf (S3 method for boss)
mu_boss_aicc = predict(boss_result, newx=x)
# the above is equivalent to the following
mu_boss_aicc = cbind(1,x) %*% beta_boss_aicc

## Repeat the above process, but using Cp-hdf instead of AICc-hdf
## coefficient vector
beta_boss_cp = coef(boss_result, method.boss='cp')
beta_boss_cp = boss_result$beta_boss[, which.min(boss_result$IC_boss$cp), drop=FALSE]
## fitted values of BOSS-Cp-hdf
mu_boss_cp = predict(boss_result, newx=x, method.boss='cp')
mu_boss_cp = cbind(1,x) %*% beta_boss_cp

[Package BOSSreg version 0.2.0 Index]