lmSP_cv {fdaSP}R Documentation

Cross-validation for Sparse Adaptive Overlap Group Least Absolute Shrinkage and Selection Operator

Description

Sparse Adaptive overlap group-LASSO, or sparse adaptive group L_2-regularized regression, solves the following optimization problem

\textrm{min}_{\beta,\gamma} ~ \frac{1}{2}\|y-X\beta-Z\gamma\|_2^2 + \lambda\Big[(1-\alpha) \sum_{g=1}^G \|S_g T\beta\|_2+\alpha\Vert T_1\beta\Vert_1\Big]

to obtain a sparse coefficient vector \beta\in\mathbb{R}^p for the matrix of penalized predictors X and a coefficient vector \gamma\in\mathbb{R}^q for the matrix of unpenalized predictors Z. For each group g, each row of the matrix S_g\in\mathbb{R}^{n_g\times p} has non-zero entries only for those variables belonging to that group. These values are provided by the arguments groups and group_weights (see below). Each variable can belong to more than one group. The diagonal matrix T\in\mathbb{R}^{p\times p} contains the variable-specific weights. These values are provided by the argument var_weights (see below). The diagonal matrix T_1\in\mathbb{R}^{p\times p} contains the variable-specific L_1 weights. These values are provided by the argument var_weights_L1 (see below). The regularization path is computed for the sparse adaptive overlap group-LASSO penalty at a grid of values for the regularization parameter \lambda using the alternating direction method of multipliers (ADMM). See Boyd et al. (2011) and Lin et al. (2022) for details on the ADMM method. The regularization is a combination of L_2 and L_1 simultaneous constraints. Different specifications of the penalty argument lead to different models choice:

LASSO

The classical Lasso regularization (Tibshirani, 1996) can be obtained by specifying \alpha = 1 and the matrix T_1 as the p \times p identity matrix. An adaptive version of this model (Zou, 2006) can be obtained if T_1 is a p \times p diagonal matrix of adaptive weights. See also Hastie et al. (2015) for further details.

GLASSO

The group-Lasso regularization (Yuan and Lin, 2006) can be obtained by specifying \alpha = 0, non-overlapping groups in S_g and by setting the matrix T equal to the p \times p identity matrix. An adaptive version of this model can be obtained if the matrix T is a p \times p diagonal matrix of adaptive weights. See also Hastie et al. (2015) for further details.

spGLASSO

The sparse group-Lasso regularization (Simon et al., 2011) can be obtained by specifying \alpha\in(0,1), non-overlapping groups in S_g and by setting the matrices T and T_1 equal to the p \times p identity matrix. An adaptive version of this model can be obtained if the matrices T and T_1 are p \times p diagonal matrices of adaptive weights.

OVGLASSO

The overlap group-Lasso regularization (Jenatton et al., 2011) can be obtained by specifying \alpha = 0, overlapping groups in S_g and by setting the matrix T equal to the p \times p identity matrix. An adaptive version of this model can be obtained if the matrix T is a p \times p diagonal matrix of adaptive weights.

spOVGLASSO

The sparse overlap group-Lasso regularization (Jenatton et al., 2011) can be obtained by specifying \alpha\in(0,1), overlapping groups in S_g and by setting the matrices T and T_1 equal to the p \times p identity matrix. An adaptive version of this model can be obtained if the matrices T and T_1 are p \times p diagonal matrices of adaptive weights.

Usage

lmSP_cv(
  X,
  Z = NULL,
  y,
  penalty = c("LASSO", "GLASSO", "spGLASSO", "OVGLASSO", "spOVGLASSO"),
  groups,
  group_weights = NULL,
  var_weights = NULL,
  var_weights_L1 = NULL,
  cv.fold = 5,
  standardize.data = TRUE,
  intercept = FALSE,
  overall.group = FALSE,
  lambda = NULL,
  alpha = NULL,
  lambda.min.ratio = NULL,
  nlambda = 30,
  control = list()
)

Arguments

X

an (n\times p) matrix of penalized predictors.

Z

an (n\times q) full column rank matrix of predictors that are not penalized.

y

a length-n response vector.

penalty

choose one from the following options: 'LASSO', for the or adaptive-Lasso penalties, 'GLASSO', for the group-Lasso penalty, 'spGLASSO', for the sparse group-Lasso penalty, 'OVGLASSO', for the overlap group-Lasso penalty and 'spOVGLASSO', for the sparse overlap group-Lasso penalty.

groups

either a vector of length p of consecutive integers describing the grouping of the coefficients, or a list with two elements: the first element is a vector of length \sum_{g=1}^G n_g containing the variables belonging to each group, where n_g is the cardinality of the g-th group, while the second element is a vector of length G containing the group lengths (see example below).

group_weights

a vector of length G containing group-specific weights. The default is square root of the group cardinality, see Yuan and Lin (2006).

var_weights

a vector of length p containing variable-specific weights. The default is a vector of ones.

var_weights_L1

a vector of length p containing variable-specific weights for the L_1 penalty. The default is a vector of ones.

cv.fold

the number of folds - default is 5.

standardize.data

logical. Should data be standardized?

intercept

logical. If it is TRUE, a column of ones is added to the design matrix.

overall.group

logical. This setting is only available for the overlap group-LASSO and the sparse overlap group-LASSO penalties, otherwise it is set to NULL. If it is TRUE, an overall group including all penalized covariates is added.

lambda

either a regularization parameter or a vector of regularization parameters. In this latter case the routine computes the whole path. If it is NULL values for lambda are provided by the routine.

alpha

the sparse overlap group-LASSO mixing parameter, with 0\leq\alpha\leq1. This setting is only available for the sparse group-LASSO and the sparse overlap group-LASSO penalties, otherwise it is set to NULL. The LASSO and group-LASSO penalties are obtained by specifying \alpha = 1 and \alpha = 0, respectively.

lambda.min.ratio

smallest value for lambda, as a fraction of the maximum lambda value. If n>p, the default is 0.0001, and if n<p, the default is 0.01.

nlambda

the number of lambda values - default is 30.

control

a list of control parameters for the ADMM algorithm. See ‘Details’.

Value

A named list containing

sp.coefficients

a length-p solution vector for the parameters \beta. If n_\lambda>1 then the provided vector corresponds to the minimum cross-validated MSE.

coefficients

a length-q solution vector for the parameters \gamma. If n_\lambda>1 then the provided vector corresponds to the minimum cross-validated MSE. It is provided only when either the matrix Z in input is not NULL or the intercept is set to TRUE.

sp.coef.path

an (n_\lambda\times p) matrix of estimated \beta coefficients for each lambda of the provided sequence.

coef.path

an (n_\lambda\times q) matrix of estimated \gamma coefficients for each lambda of the provided sequence. It is provided only when either the matrix Z in input is not NULL or the intercept is set to TRUE.

lambda

sequence of lambda.

lambda.min

value of lambda that attains the minimum cross-validated MSE.

mse

cross-validated mean squared error.

min.mse

minimum value of the cross-validated MSE for the sequence of lambda.

convergence

logical. 1 denotes achieved convergence.

elapsedTime

elapsed time in seconds.

iternum

number of iterations.

When you run the algorithm, output returns not only the solution, but also the iteration history recording following fields over iterates:

objval

objective function value

r_norm

norm of primal residual

s_norm

norm of dual residual

eps_pri

feasibility tolerance for primal feasibility condition

eps_dual

feasibility tolerance for dual feasibility condition.

Iteration stops when both r_norm and s_norm values become smaller than eps_pri and eps_dual, respectively.

Details

The control argument is a list that can supply any of the following components:

adaptation

logical. If it is TRUE, ADMM with adaptation is performed. The default value is TRUE. See Boyd et al. (2011) for details.

rho

an augmented Lagrangian parameter. The default value is 1.

tau.ada

an adaptation parameter greater than one. Only needed if adaptation = TRUE. The default value is 2. See Boyd et al. (2011) for details.

mu.ada

an adaptation parameter greater than one. Only needed if adaptation = TRUE. The default value is 10. See Boyd et al. (2011) for details.

abstol

absolute tolerance stopping criterion. The default value is sqrt(sqrt(.Machine$double.eps)).

reltol

relative tolerance stopping criterion. The default value is sqrt(.Machine$double.eps).

maxit

maximum number of iterations. The default value is 100.

print.out

logical. If it is TRUE, a message about the procedure is printed. The default value is TRUE.

References

Bernardi M, Canale A, Stefanucci M (2022). “Locally Sparse Function-on-Function Regression.” Journal of Computational and Graphical Statistics, 0(0), 1-15. doi:10.1080/10618600.2022.2130926, https://doi.org/10.1080/10618600.2022.2130926.

Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011). “Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers.” Foundations and Trends® in Machine Learning, 3(1), 1-122. ISSN 1935-8237, doi:10.1561/2200000016, http://dx.doi.org/10.1561/2200000016.

Hastie T, Tibshirani R, Wainwright M (2015). Statistical learning with sparsity: the lasso and generalizations, number 143 in Monographs on statistics and applied probability. CRC Press, Taylor & Francis Group, Boca Raton. ISBN 978-1-4987-1216-3.

Jenatton R, Audibert J, Bach F (2011). “Structured variable selection with sparsity-inducing norms.” J. Mach. Learn. Res., 12, 2777–2824. ISSN 1532-4435.

Lin Z, Li H, Fang C (2022). Alternating direction method of multipliers for machine learning. Springer, Singapore. ISBN 978-981-16-9839-2; 978-981-16-9840-8, doi:10.1007/978-981-16-9840-8, With forewords by Zongben Xu and Zhi-Quan Luo.

Simon N, Friedman J, Hastie T, Tibshirani R (2013). “A sparse-group lasso.” J. Comput. Graph. Statist., 22(2), 231–245. ISSN 1061-8600, doi:10.1080/10618600.2012.681250.

Yuan M, Lin Y (2006). “Model selection and estimation in regression with grouped variables.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49–67.

Zou H (2006). “The adaptive lasso and its oracle properties.” J. Amer. Statist. Assoc., 101(476), 1418–1429. ISSN 0162-1459, doi:10.1198/016214506000000735.

Examples


### generate sample data
set.seed(2023)
n    <- 50
p    <- 30 
X    <- matrix(rnorm(n * p), n, p)

### Example 1, LASSO penalty

beta <- apply(matrix(rnorm(p, sd = 1), p, 1), 1, fdaSP::softhresh, 1.5)
y    <- X %*% beta + rnorm(n, sd = sqrt(crossprod(X %*% beta)) / 20)

### set the hyper-parameters of the ADMM algorithm
maxit      <- 1000
adaptation <- TRUE
rho        <- 1
reltol     <- 1e-5
abstol     <- 1e-5

### run cross-validation
mod_cv <- lmSP_cv(X = X, y = y, penalty = "LASSO", 
                  standardize.data = FALSE, intercept = FALSE,
                  cv.fold = 5, nlambda = 30, 
                  control = list("adaptation" = adaptation, 
                                 "rho" = rho, 
                                 "maxit" = maxit, "reltol" = reltol, 
                                 "abstol" = abstol, 
                                 "print.out" = FALSE)) 

### graphical presentation
plot(log(mod_cv$lambda), mod_cv$mse, type = "l", col = "blue", lwd = 2, bty = "n", 
     xlab = latex2exp::TeX("$\\log(\\lambda)$"), ylab = "Prediction Error", 
     ylim = range(mod_cv$mse - mod_cv$mse.sd, mod_cv$mse + mod_cv$mse.sd),
     main = "Cross-validated Prediction Error")
fdaSP::confband(xV = log(mod_cv$lambda), yVmin = mod_cv$mse - mod_cv$mse.sd, 
                yVmax = mod_cv$mse + mod_cv$mse.sd)       
abline(v = log(mod_cv$lambda[which(mod_cv$lambda == mod_cv$lambda.min)]), 
       col = "red", lwd = 1.0)

### comparison with oracle error
mod <- lmSP(X = X, y = y, penalty = "LASSO", 
            standardize.data = FALSE, 
            intercept = FALSE,
            nlambda = 30, 
            control = list("adaptation" = adaptation, 
                           "rho" = rho, 
                           "maxit" = maxit, "reltol" = reltol, 
                           "abstol" = abstol, 
                           "print.out" = FALSE)) 
                                         
err_mod <- apply(mod$sp.coef.path, 1, function(x) sum((x - beta)^2))
plot(log(mod$lambda), err_mod, type = "l", col = "blue", lwd = 2, 
     xlab = latex2exp::TeX("$\\log(\\lambda)$"), 
     ylab = "Estimation Error", main = "True Estimation Error", bty = "n")
abline(v = log(mod$lambda[which(err_mod == min(err_mod))]), col = "red", lwd = 1.0)
abline(v = log(mod_cv$lambda[which(mod_cv$lambda == mod_cv$lambda.min)]), 
       col = "red", lwd = 1.0, lty = 2)

### Example 2, sparse group-LASSO penalty

beta <- c(rep(4, 12), rep(0, p - 13), -2)
y    <- X %*% beta + rnorm(n, sd = sqrt(crossprod(X %*% beta)) / 20)

### define groups of dimension 3 each
group1 <- rep(1:10, each = 3)

### set regularization parameter grid
lam   <- 10^seq(1, -2, length.out = 30)

### set the alpha parameter 
alpha <- 0.5

### set the hyper-parameters of the ADMM algorithm
maxit         <- 1000
adaptation    <- TRUE
rho           <- 1
reltol        <- 1e-5
abstol        <- 1e-5

### run cross-validation
mod_cv <- lmSP_cv(X = X, y = y, penalty = "spGLASSO", 
                  groups = group1, cv.fold = 5, 
                  standardize.data = FALSE,  intercept = FALSE, 
                  lambda = lam, alpha = 0.5, 
                  control = list("adaptation" = adaptation, 
                                 "rho" = rho,
                                 "maxit" = maxit, "reltol" = reltol, 
                                 "abstol" = abstol, 
                                 "print.out" = FALSE)) 

### graphical presentation
plot(log(mod_cv$lambda), mod_cv$mse, type = "l", col = "blue", lwd = 2, bty = "n", 
     xlab = latex2exp::TeX("$\\log(\\lambda)$"), ylab = "Prediction Error", 
     ylim = range(mod_cv$mse - mod_cv$mse.sd, mod_cv$mse + mod_cv$mse.sd),
     main = "Cross-validated Prediction Error")
fdaSP::confband(xV = log(mod_cv$lambda), yVmin = mod_cv$mse - mod_cv$mse.sd, 
                yVmax = mod_cv$mse + mod_cv$mse.sd)       
abline(v = log(mod_cv$lambda[which(mod_cv$lambda == mod_cv$lambda.min)]), 
       col = "red", lwd = 1.0)

### comparison with oracle error
mod <- lmSP(X = X, y = y, 
            penalty = "spGLASSO", 
            groups = group1, 
            standardize.data = FALSE, 
            intercept = FALSE,
            lambda = lam, 
            alpha = 0.5, 
            control = list("adaptation" = adaptation, "rho" = rho, 
                           "maxit" = maxit, "reltol" = reltol, "abstol" = abstol, 
                           "print.out" = FALSE)) 
                                         
err_mod <- apply(mod$sp.coef.path, 1, function(x) sum((x - beta)^2))
plot(log(mod$lambda), err_mod, type = "l", col = "blue", lwd = 2, 
     xlab = latex2exp::TeX("$\\log(\\lambda)$"), 
     ylab = "Estimation Error", main = "True Estimation Error", bty = "n")
abline(v = log(mod$lambda[which(err_mod == min(err_mod))]), col = "red", lwd = 1.0)
abline(v = log(mod_cv$lambda[which(mod_cv$lambda == mod_cv$lambda.min)]), 
       col = "red", lwd = 1.0, lty = 2)


[Package fdaSP version 1.1.1 Index]