R: Regularised fit of sparse linear regression

lm.pels.fit {fsemipar}

R Documentation

Regularised fit of sparse linear regression

Description

This function fits a sparse linear model between a scalar response and a vector of scalar covariates. It employs a penalised least-squares regularisation procedure, with either (group)SCAD or (group)LASSO penalties. The method utilises an objective criterion (criterion) to select the optimal regularisation parameter (lambda.opt).

Usage

lm.pels.fit(z, y, lambda.min = NULL, lambda.min.h = NULL, lambda.min.l = NULL,
factor.pn = 1, nlambda = 100, lambda.seq = NULL, vn = ncol(z), nfolds = 10, 
seed = 123, criterion = "GCV", penalty = "grSCAD", max.iter = 1000)

Arguments

`z`	Matrix containing the observations of the covariates collected by row.
`y`	Vector containing the scalar response.
`lambda.min`	The smallest value for lambda (i. e., the lower endpoint of the sequence in which `lambda.opt` is selected), as fraction of `lambda.max`. The defaults is `lambda.min.l` if the sample size is larger than `factor.pn` times the number of linear covariates and `lambda.min.h` otherwise.
`lambda.min.h`	The lower endpoint of the sequence in which `lambda.opt` is selected if the sample size is smaller than `factor.pn` times the number of linear covariates. The default is 0.05.
`lambda.min.l`	The lower endpoint of the sequence in which `lambda.opt` is selected if the sample size is larger than `factor.pn` times the number of linear covariates. The default is 0.0001.
`factor.pn`	Positive integer used to set `lambda.min`. The default value is 1.
`nlambda`	Positive integer indicating the number of values in the sequence from which `lambda.opt` is selected. The default is 100.
`lambda.seq`	Sequence of values in which `lambda.opt` is selected. If `lambda.seq=NULL`, then the programme builds the sequence automatically using `lambda.min` and `nlambda`.
`vn`	Positive integer or vector of positive integers indicating the number of groups of consecutive variables to be penalised together. The default value is `vn=ncol(z)`, resulting in the individual penalization of each scalar covariate.
`nfolds`	Number of cross-validation folds (used when `criterion="k-fold-CV"`). Default is 10.
`seed`	You may set the seed for the random number generator to ensure reproducible results (applicable when `criterion="k-fold-CV"` is used). The default seed value is 123.
`criterion`	The criterion used to select the regularisation parameter `lambda.opt` (also `vn.opt` if needed). Options include `"GCV"`, `"BIC"`, `"AIC"`, or `"k-fold-CV"`. The default setting is `"GCV"`.
`penalty`	The penalty function applied in the penalised least-squares procedure. Currently, only "grLasso" and "grSCAD" are implemented. The default is "grSCAD".
`max.iter`	Maximum number of iterations allowed across the entire path. The default value is 1000.

Details

The sparse linear model (SLM) is given by the expression:

Y_i=Z_{i1}\beta_{01}+\dots+Z_{ip_n}\beta_{0p_n}+\varepsilon_i\ \ \ i=1,\dots,n,

where Y_i denotes a scalar response, Z_{i1},\dots,Z_{ip_n} are real covariates. In this equation, \mathbf{\beta}_0=(\beta_{01},\dots,\beta_{0p_n})^{\top} is a vector of unknown real parameters and \varepsilon_i represents the random error.

In this function, the SLM is fitted using a penalised least-squares (PeLS) approach by minimising

\mathcal{Q}\left(\mathbf{\beta}\right)=\frac{1}{2}\left(\mathbf{Y}-\mathbf{Z}\mathbf{\beta}\right)^{\top}\left(\mathbf{Y}-\mathbf{Z}\mathbf{\beta}\right)+n\sum_{j=1}^{p_n}\mathcal{P}_{\lambda_{j_n}}\left(|\beta_j|\right), \quad (1)

where \mathbf{\beta}=(\beta_1,\ldots,\beta_{p_n})^{\top}, \ \mathcal{P}_{\lambda_{j_n}}\left(\cdot\right) is a penalty function (specified in the argument penalty) and \lambda_{j_n} > 0 is a tuning parameter. To reduce the number of tuning parameters, \lambda_j, to be selected for each sample, we consider \lambda_j = \lambda \widehat{\sigma}_{\beta_{0,j,OLS}}, where \beta_{0,j,OLS} denotes the OLS estimate of \beta_{0,j} and \widehat{\sigma}_{\beta_{0,j,OLS}} is the estimated standard deviation. The parameter \lambda is selected using the objetive criterion specified in the argument criterion.

For further details on the estimation procedure of the SLM, see e.g. Fan and Li. (2001). The PeLS objective function is minimised using the R function grpreg of the package grpreg (Breheny and Huang, 2015).

Remark: It should be noted that if we set lambda.seq to =0, we obtain the non-penalised estimation of the model, i.e. the OLS estimation. Using lambda.seq with a vaule \not=0 is advisable when suspecting the presence of irrelevant variables.

Value

`call`	The matched call.
`fitted.values`	Estimated scalar response.
`residuals`	Differences between `y` and the `fitted.values`.
`beta.est`	Estimate of `\beta_0` when the optimal penalisation parameter `lambda.opt` and `vn.opt` are used.
`indexes.beta.nonnull`	Indexes of the non-zero `\hat{\beta_{j}}`.
`lambda.opt`	Selected value of lambda.
`IC`	Value of the criterion function considered to select `lambda.opt` and `vn.opt`.
`vn.opt`	Selected value of `vn`.
`...`

Author(s)

German Aneiros Perez german.aneiros@udc.es

Silvia Novo Diaz snovo@est-econ.uc3m.es

References

Breheny, P., and Huang, J. (2015) Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Statistics and Computing, 25, 173–187, doi:10.1007/s11222-013-9424-2.

Fan, J., and Li, R. (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360, doi:10.1198/016214501753382273.

Examples

data("Tecator")
y<-Tecator$fat
z1<-Tecator$protein       
z2<-Tecator$moisture

#Quadratic, cubic and interaction effects of the scalar covariates.
z.com<-cbind(z1,z2,z1^2,z2^2,z1^3,z2^3,z1*z2)
train<-1:160

#LM fit 
ptm=proc.time()
fit<-lm.pels.fit(z=z.com[train,], y=y[train],lambda.min.h=0.02,
      lambda.min.l=0.01,factor.pn=2, max.iter=5000, criterion="BIC")
proc.time()-ptm

#Results
fit
names(fit)

[Package fsemipar version 1.1.1 Index]