R: dgLARS Solution Curve for GLM

dglars {dglars}

R Documentation

dgLARS Solution Curve for GLM

Description

dglars function is used to estimate the solution curve defined by dgLARS method.

Usage

dglars(formula, family = gaussian, g, unpenalized, 
b_wght, data, subset, contrasts = NULL, control = list())

dglars.fit(X, y, family = gaussian, g, unpenalized, 
b_wght, control = list())

Arguments

`formula`	an object of class “`formula`”: a symbolic description of the model to be fitted. When the `binomial` family is used, the responce can be a vector with entries 0/1 (failure/success) or, alternatively, a matrix where the first column is the number of “successes” and the second column is the number of “failures”.
`family`	a description of the error distribution and link function used to specify the model. This can be a character string naming a family function or the result of a call to a family function (see `family` for details). By default the gaussian family with identity link function is used.
`g`	argument available only for `ccd` algorithm. When the model is fitted by using the `ccd` algorithm, this argument can be used to specify the values of the tuning parameter.
`unpenalized`	a vector used to specify the unpenalized estimators; `unpenalized` can be a vector of integers or characters specifying the names of the predictors with unpenalized estimators (see example below for more details).
`b_wght`	a `p+1`-dimensional vector used to compute the weights in the adaptive dgLARS method. `b_wght` is used to specify the initial estimates of the parameter vector.
`data`	an optional data frame, list or environment (or object coercible by ‘as.data.frame’ to a data frame) containing the variables in the model. If not found in ‘data’, the variables are taken from ‘environment(formula)’.
`subset`	an optional vector specifying a subset of observations to be used in the fitting process.
`contrasts`	an optional list. See the ‘contrasts.arg’ of ‘model.matrix.default’.
`control`	a list of control parameters. See ‘Details’.
`X`	design matrix of dimension `n\times p`.
`y`	response vector. When the `binomial` family is used, this argument can be a vector with entries 0 (failure) or 1 (success). Alternatively, the response can be a matrix where the first column is the number of “successes” and the second column is the number of “failures”.

Details

dglars function implements the differential geometric generalization of the least angle regression method (Efron et al., 2004) proposed in Augugliaro et al. (2013) and Pazira et al. (2017).

As in “glm”, the user can specify family and link function using the argument family. When the binomial family is used, the responce can be a vector with entries 0/1 (failure/success) or, alternatively, a matrix where the first column is the number of “successes” and the second column is the number of “failures”. Starting with the version 2.0.0, the model can be specified combining family and link functions as describted in the following table:

Family	Link
`gaussian`	‘`identity`’, ‘`log`’ and ‘`inverse`’
`binomial`	‘`logit`’, ‘`probit`’, ‘`cauchit`’, ‘`log`’ and ‘`cloglog`’
`poisson`	‘`log`’, ‘`identity`’, and ‘`sqrt`’
`Gamma`	‘`inverse`’, ‘`identity`’ and ‘`log`’
`inverse.gaussian`	‘`1/mu^2`’, ‘`inverse`’, ‘`identity`’, and ‘`log`’

The R code for binomial, Gamma and inverse gaussian families is due to Hassan Pazira while the fortran version is due to Luigi Augugliaro.

dglars.fit is a workhorse function: it is more efficient when the design matrix does not require manipulations. For this reason we suggest to use this function when the dgLARS method is applied in a high-dimensional setting, i.e., when p>n.

When gaussian, gamma or inverse.gaussian is used to model the error distribution, dglars returns the vector of the estimates of the dispersion parameter \phi; by default, the generalized Pearson statistic is used as estimator but the user can use the function phihat to specify other estimators (see phihat for more details).

The dgLARS solution curve can be estimated using two different algorithms, i.e. the predictor-corrector method and the cyclic coordinate descent method (see below for more details about the argument algorithm). The first algorithm is based on two steps. In the first step, called predictor step, an approximation of the point that lies on the solution curve is computed. If the control parameter dg_max is equal to zero, in this step it is also computed an approximation of the optimal step size using a generalization of the method proposed in Efron et al. (2004). The optimal step size is defined as the reduction of the tuning parameter, denoted by d\gamma, such that at \gamma-d\gamma there is a change in the active set. In the second step, called corrector step, a Newton-Raphson algorithm is used to correct the approximation previously computed. The main problem of this algorithm is that the number of arithmetic operations required to compute the approximation scales as the cube of the variables, this means that such algorithm is cumbersome in a high dimensional setting. To overcome this problem, the second algorithm compute the dgLARS solution curve using an adaptive version of the cyclic coordinate descent method proposed in Friedman et al. (2010).

The argument control is a list that can supply any of the following components:

algorithm:

a string specifying the algorithm used to compute the solution curve. The predictor-corrector algorithm is used when algorithm = ''pc'' (default), while the cyclic coordinate descent method is used setting algorithm = ''ccd'';

method:

a string by means of to specify the kind of solution curve. If method = ''dgLASSO'' (default) the algorithm computes the solution curve defined by the differential geometric generalization of the LASSO estimator; otherwise (method = ''dgLARS'') the differential geometric generalization of the least angle regression method is used;

nv:

control parameter for the pc algorithm. An integer value between 1 and \min(n,p) used to specify the maximum number of variables in the final model. Default is nv = min(n - 1, p);

np:

control parameter for the pc/ccd algorithm. A non negative integer used to define the maximum number of solution points. For the predictor-corrector algorithm np is set to 50\times\min(n - 1, p) (default); for the cyclic coordinate descent method, if g is not specified, this argument is set equal to 100 (default);

g0:

control parameter for the pc/ccd algorithm. This parameter is used to set the smallest value for the tuning parameter \gamma. Default is g0 = ifelse(p < n, 1.0e-04, 0.05); this argument is not required when g is used with the cyclic coordinate descent algorithm;

dg_max:

control parameter for the pc algorithm. A non negative value used to specify the largest value for the step size. Setting dg_max = 0 (default) the predictor-corrector algorithm computes an approximation of the optimal step size (see Augugliaro et al. (2013) for more details);

nNR:

control criterion parameter for the pc algorithm. A non negative integer used to specify the maximum number of iterations of the Newton-Raphson algorithm. Default is nNR = 50;

NReps:

control parameter for the pc algorithm. A non negative value used to define the convergence of the Newton-Raphson algorithm. Default is NReps = 1.0e-06;

ncrct:

control parameter for the pc algorithm. When the Newton-Raphson algorithm does not converge, the step size (d\gamma) is reduced by d\gamma = cf \cdot d\gamma and the corrector step is repeated. ncrct is a non negative integer used to specify the maximum number of trials for the corrector step. Default is ncrct = 50;

cf:

control parameter for the pc algorithm. The contractor factor is a real value belonging to the interval [0,1] used to reduce the step size as previously described. Default is cf = 0.5;

nccd:

control parameter for the ccd algorithm. A non negative integer used to specify the maximum number for steps of the cyclic coordinate descent algorithm. Default is 1.0e+05.

eps

control parameter for the pc/ccd algorithm. The meaning of this parameter is related to the algorithm used to estimate the solution curve:

i.

if algorithm = ''pc'' then eps is used

a.: to identify a variable that will be included in the active set (absolute value of the corresponding Rao's score test statistic belongs to [\gamma - \code{eps}, \gamma + \code{eps}]);
b.: to establish if the corrector step must be repeated;
c.: to define the convergence of the algorithm, i.e., the actual value of the tuning parameter belongs to the interval [\code{g0 - eps},\code{g0 + eps}];

ii.

if algorithm = ''ccd'' then eps is used to define the convergence for a single solution point, i.e., each inner coordinate-descent loop continues until the maximum change in the Rao's score test statistic, after any coefficient update, is less than eps.

Default is eps = 1.0e-05.

Value

dglars returns an object with S3 class “dglars”, i.e., a list containing the following components:

`call`	the call that produced this object;
`formula`	if the model is fitted by `dglars`, the used formula is returned;
`family`	a description of the error distribution used in the model;
`unpenalized`	the vector used to specify the unpenalized estimators;
`np`	the number of points of the dgLARS solution curve;
`beta`	the `(p + 1)\times\code{np}` matrix corresponding to the dgLARS solution curve;
`phi`	the `np` dimensional vector of the Pearson estimates of the disperion parameter;
`ru`	the matrix of the Rao's score test statistics of the variables included in the final model. This component is reported only if the predictor-corrector algorithm is used to fit the model;
`dev`	the `np` dimensional vector of the deviance corresponding to the values of the tuning parameter `\gamma`;
`nnonzero`	the sequence of number of nonzero coefficients for each value of the tuning parameter `\gamma`;
`g`	the sequence of `\gamma` values used to compute the solution curve;
`X`	the used design matrix;
`y`	the used response vector;
`w`	the vector of weights used to compute the adaptive dglars method;
`action`	a `np` dimensional vector of characters used to show how is changed the active set for each value of the tuning parameter `\gamma`;
`conv`	an integer value used to encode the warnings and the errors related to the algorithm used to fit the model. The values returned are: `0` convergence of the algorithm has been achieved; `1` problems related with the predictor-corrector method: error in predictor step; `2` problems related with the predictor-corrector method: error in corrector step; `3` maximum number of iterations has been reached; `4` error in dynamic allocation memory; `5` fitted expected value is out of range; `6` does not exist dgLARS estimator; `7` maximum number of solution points ('codenp') reached.
`control`	the list of control parameters used to compute the dgLARS solution curve.

Author(s)

Luigi Augugliaro and Hassan Pazira
Maintainer: Luigi Augugliaro luigi.augugliaro@unipa.it

References

Augugliaro L., Mineo A.M. and Wit E.C. (2016) <doi:10.1093/biomet/asw023> A differential-geometric approach to generalized linear models with grouped predictors, Biometrika, Vol 103(3), 563-577.

Augugliaro L., Mineo A.M. and Wit E.C. (2014) <doi:10.18637/jss.v059.i08> dglars: An R Package to Estimate Sparse Generalized Linear Models, Journal of Statistical Software, Vol 59(8), 1-40. https://www.jstatsoft.org/v59/i08/.

Augugliaro L., Mineo A.M. and Wit E.C. (2013) <doi:10.1111/rssb.12000> dgLARS: a differential geometric approach to sparse generalized linear models, Journal of the Royal Statistical Society. Series B., Vol 75(3), 471-498.

Efron B., Hastie T., Johnstone I. and Tibshirani R. (2004) <doi:10.1214/009053604000000067> Least Angle Regression, The Annals of Statistics, Vol. 32(2), 407-499.

Friedman J., Hastie T. and Tibshirani R. (2010) <doi:10.18637/jss.v033.i01> Regularization Paths for Generalized Linear Models via Coordinate Descent, Journal of Statistical Software, Vol. 33(1), 1-22.

Pazira H., Augugliaro L. and Wit E.C. (2018) <doi:10.1007/s11222-017-9761-7> Extended di erential geometric LARS for high-dimensional GLMs with general dispersion parameter, Statistics and Computing, Vol. 28(4), 753-774.

Examples

set.seed(123)

#############################
# y ~ Binomial
n <- 100
p <- 100
X <- matrix(rnorm(n * p), n, p)
eta <- 1 + 2 * X[,1]
mu <- binomial()$linkinv(eta)
y <- rbinom(n, 1, mu)
fit <- dglars(y ~ X, family = binomial)
fit

# adaptive dglars method
b_wght <- coef(fit)$beta[, 20]
fit <- dglars(y ~ X, family = binomial, b_wght = b_wght) 
fit 

# the first three coefficients are not penalized
fit <- dglars(y ~ X, family = binomial, unpenalized = 1:3) 
fit 

# 'probit' link function
fit <- dglars(y ~ X, family = binomial("probit"))
fit

############################
# y ~ Poisson 
n <- 100
p <- 100
X <- matrix(rnorm(n * p), n, p)
eta <- 2 + 2 * X[,1]
mu <- poisson()$linkinv(eta)
y <- rpois(n, mu)
fit <- dglars(y ~ X, family = poisson)
fit

############################
# y ~ Gamma
n <- 100
p <- 100
X <- matrix(abs(rnorm(n*p)),n,p)
eta <- 1 + 2 * X[,1]
mu <- drop(Gamma()$linkinv(eta))
shape <- 0.5
phi <- 1 / shape
y <- rgamma(n, scale = mu / shape, shape = shape)
fit <- dglars(y ~ X, Gamma("log"))
fit

[Package dglars version 2.1.7 Index]