R: Cross-Validation Method for dgLARS

cvdglars {dglars}

R Documentation

Cross-Validation Method for dgLARS

Description

Uses the k-fold cross-validation deviance to estimate the solution point of the dgLARS solution curve.

Usage

cvdglars(formula, family = gaussian, g, unpenalized, 
b_wght, data, subset, contrasts = NULL, control = list())

cvdglars.fit(X, y, family = gaussian, g, unpenalized,
b_wght, control = list())

Arguments

`formula`	an object of class “`formula`”: a symbolic description of the model to be fitted. When the `binomial` family is used, the responce can be a vector with entries 0/1 (failure/success) or, alternatively, a matrix where the first column is the number of “successes” and the second column is the number of “failures”.
`family`	a description of the error distribution and link function used to specify the model. This can be a character string naming a family function or the result of a call to a family function (see `family` for details). By default the gaussian family with identity link function is used.
`g`	argument available only for `ccd` algorithm. When the `ccd` algorithm is used to fit the dgLARS model, this argument can be used to specify the values of the tuning parameter.
`unpenalized`	a vector used to specify the unpenalized estimators; `unpenalized` can be a vector of integers or characters specifying the names of the predictors with unpenalized estimators.
`b_wght`	a vector, with length equal to the number of columns of the matrix `X`, used to compute the weights used in the adaptive dgLARS method. `b_wght` is used to specify the initial estimates of the parameter vector.
`data`	an optional data frame, list or environment (or object coercible by ‘as.data.frame’ to a data frame) containing the variables in the model. If not found in ‘data’, the variables are taken from ‘environment(formula)’.
`subset`	an optional vector specifying a subset of observations to be used in the fitting process.
`contrasts`	an optional list. See the ‘contrasts.arg’ of ‘model.matrix.default’.
`control`	a list of control parameters. See ‘Details’.
`X`	design matrix of dimension `n\times p`.
`y`	response vector. When the `binomial` family is used, this argument can be a vector with entries 0 (failure) or 1 (success). Alternatively, the response can be a matrix where the first column is the number of “successes” and the second column is the number of “failures”.

Details

cvdglars function runs dglars nfold+1 times. The deviance is stored, and the average and its standard deviation over the folds are computed.

cvdglars.fit is the workhorse function: it is more efficient when the design matrix have already been calculated. For this reason we suggest to use this function when the dgLARS method is applied in a high-dimensional setting, i.e. when p>n.

The control argument is a list that can supply any of the following components:

algorithm:

a string specifying the algorithm used to compute the solution curve. The predictor-corrector algorithm is used when algorithm = ''pc'' (default), while the cyclic coordinate d escent method is used setting algorithm = ''ccd'';

method:

a string by means of to specify the kind of solution curve. If method = ''dgLASSO'' (default) the algorithm computes the solution curve defined by the differential geometric generalization of the LASSO estimator; otherwise, if method = ''dgLARS'', the differential geometric generalization of the least angle regression method is used;

nfold:

a non negative integer used to specify the number of folds. Although nfolds can be as large as the sample size (leave-one-out CV), it is not recommended for large datasets. Default is nfold = 10;

foldid

a n-dimensional vector of integers, between 1 and n, used to define the folds for the cross-validation. By default foldid is randomly generated;

ng:

number of values of the tuning parameter used to compute the cross-validation deviance. Default is ng = 100;

nv:

control parameter for the pc algorithm. An integer value belonging to the interval [1;min(n,p)] (default is nv = min(n-1,p)) used to specify the maximum number of variables included in the final model;

np:

control parameter for the pc/ccd algorithm. A non negative integer used to define the maximum number of points of the solution curve. For the predictor-corrector algorithm np is set to 50 \cdot min(n-1,p) (default), while for the cyclic coordinate descent method is set to 100 (default), i.e. the number of values of the tuning parameter \gamma;

g0:

control parameter for the pc/ccd algorithm. Set the smallest value for the tuning parameter \gamma. Default is g0 = ifelse(p<n, 1.0e-06, 0.05);

dg_max:

control parameter for the pc algorithm. A non negative value used to specify the maximum length of the step size. Setting dg_max = 0 (default) the predictor-corrector algorithm uses the optimal step size (see Augugliaro et al. (2013) for more details) to approximate the value of the tuning parameter corresponding to the inclusion/exclusion of a variable from the model;

nNR:

control parameter for the pc algorithm. A non negative integer used to specify the maximum number of iterations of the Newton-Raphson algorithm used in the corrector step. Default is nNR = 200;

NReps:

control parameter for the pc algorithm. A non negative value used to define the convergence criterion of the Newton-Raphson algorithm. Default is NReps = 1.0e-06;

ncrct:

control parameter for the pc algorithm. When the Newton-Raphson algorithm does not converge, the step size (d\gamma) is reduced by d\gamma = cf \cdot d\gamma and the corrector step is repeated. ncrct is a non negative integer used to specify the maximum number of trials for the corrector step. Default is ncrct = 50;

cf:

control parameter for the pc algorithm. The contractor factor is a real value belonging to the interval [0,1] used to reduce the step size as previously described. Default is cf = 0.5;

nccd:

control parameter for the ccd algorithm. A non negative integer used to specify the maximum number for steps of the cyclic coordinate descent algorithm. Default is 1.0e+05.

eps

control parameter for the pc/ccd algorithm. The meaning of this parameter is related to the algorithm used to estimate the solution curve:

i.

if algorithm = ''pc'' then eps is used

a.: to identify a variable that will be included in the active set (absolute value of the corresponding Rao's score test statistic belongs to [\gamma - \code{eps}, \gamma + \code{eps}]);
b.: to establish if the corrector step must be repeated;
c.: to define the convergence of the algorithm, i.e., the actual value of the tuning parameter belongs to the interval [\code{g0 - eps},\code{g0 + eps}];

ii.

if algorithm = ''ccd'' then eps is used to define the convergence for a single solution point, i.e., each inner coordinate-descent loop continues until the maximum change in the Rao's score test statistic, after any coefficient update, is less than eps.

Default is eps = 1.0e-05.

Value

cvdglars returns an object with S3 class “cvdglars”, i.e. a list containing the following components:

`call`	the call that produced this object;
`formula_cv`	if the model is fitted by `cvdglars`, the used formula is returned;
`family`	a description of the error distribution used in the model;
`var_cv`	a character vector with the name of variables selected by cross-validation;
`beta`	the vector of the coefficients estimated by cross-validation;
`phi`	the cross-validation estimate of the disperion parameter;
`dev_m`	a vector of length `ng` used to store the mean cross-validation deviance;
`dev_v`	a vector of length `ng` used to store the variance of the mean cross-validation deviance;
`g`	the value of the tuning parameter corresponding to the minimum of the cross-validation deviance;
`g0`	the smallest value for the tuning parameter;
`g_max`	the value of the tuning parameter corresponding to the starting point of the dgLARS solution curve;
`X`	the used design matrix;
`y`	the used response vector;
`w`	the vector of weights used to compute the adaptive dglars method;
`conv`	an integer value used to encode the warnings and the errors related to the algorithm used to fit the dgLARS solution curve. The values returned are: `0` convergence of the algorithm has been achieved, `1` problems related with the predictor-corrector method: error in predictor step, `2` problems related with the predictor-corrector method: error in corrector step, `3` maximum number of iterations has been reached, `4` error in dynamic allocation memory;
`control`	the list of control parameters used to compute the cross-validation deviance.

Author(s)

Luigi Augugliaro
Maintainer: Luigi Augugliaro luigi.augugliaro@unipa.it

References

Augugliaro L., Mineo A.M. and Wit E.C. (2014) <doi:10.18637/jss.v059.i08> dglars: An R Package to Estimate Sparse Generalized Linear Models, Journal of Statistical Software, Vol 59(8), 1-40. https://www.jstatsoft.org/v59/i08/.

Augugliaro L., Mineo A.M. and Wit E.C. (2013) <doi:10.1111/rssb.12000> dgLARS: a differential geometric approach to sparse generalized linear models, Journal of the Royal Statistical Society. Series B., Vol 75(3), 471-498.

Examples

###########################
# Logistic regression model
# y ~ Binomial
set.seed(123)
n <- 100
p <- 100
X <- matrix(rnorm(n * p), n, p)
b <- 1:2
eta <- b[1] + X[, 1] * b[2]
mu <- binomial()$linkinv(eta)
y <- rbinom(n, 1, mu)
fit_cv <- cvdglars.fit(X, y, family = binomial)
fit_cv

[Package dglars version 2.1.7 Index]