dglars {dglars} | R Documentation |
dgLARS Solution Curve for GLM
Description
dglars
function is used to estimate the solution curve defined by dgLARS method.
Usage
dglars(formula, family = gaussian, g, unpenalized,
b_wght, data, subset, contrasts = NULL, control = list())
dglars.fit(X, y, family = gaussian, g, unpenalized,
b_wght, control = list())
Arguments
formula |
an object of class “ |
family |
a description of the error distribution and link
function used to specify the model. This can be a character string
naming a family function or the result of a call to a family function
(see |
g |
argument available only for |
unpenalized |
a vector used to specify the unpenalized estimators;
|
b_wght |
a |
data |
an optional data frame, list or environment (or object coercible by ‘as.data.frame’ to a data frame) containing the variables in the model. If not found in ‘data’, the variables are taken from ‘environment(formula)’. |
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
contrasts |
an optional list. See the ‘contrasts.arg’ of ‘model.matrix.default’. |
control |
a list of control parameters. See ‘Details’. |
X |
design matrix of dimension |
y |
response vector. When the |
Details
dglars
function implements the differential geometric generalization
of the least angle regression method (Efron et al., 2004) proposed in
Augugliaro et al. (2013) and Pazira et al. (2017).
As in “glm
”, the user can specify family and link function using
the argument family
. When the binomial
family is used, the responce
can be a vector with entries 0/1 (failure/success) or, alternatively, a matrix where
the first column is the number of “successes” and the second column is the number
of “failures”. Starting with the version 2.0.0, the model can be specified combining
family and link functions as describted in the following table:
Family | Link |
gaussian | ‘identity ’, ‘log ’ and ‘inverse ’ |
binomial | ‘logit ’, ‘probit ’, ‘cauchit ’, ‘log ’ and ‘cloglog ’ |
poisson | ‘log ’, ‘identity ’, and ‘sqrt ’ |
Gamma | ‘inverse ’, ‘identity ’ and ‘log ’ |
inverse.gaussian | ‘1/mu^2 ’, ‘inverse ’, ‘identity ’, and ‘log ’
|
The R
code for binomial, Gamma and inverse gaussian families is due to
Hassan Pazira while the fortran version is due to Luigi Augugliaro.
dglars.fit
is a workhorse function: it is more efficient when the design
matrix does not require manipulations. For this reason we suggest to use this function
when the dgLARS method is applied in a high-dimensional setting, i.e., when p>n
.
When gaussian, gamma or inverse.gaussian is used to model the error distribution, dglars
returns the vector of the estimates of the dispersion parameter \phi
; by
default, the generalized Pearson statistic is used as estimator but the user can use
the function phihat
to specify other estimators (see phihat
for
more details).
The dgLARS solution curve can be estimated using two different algorithms, i.e. the
predictor-corrector method and the cyclic coordinate descent method (see below for
more details about the argument algorithm
). The first algorithm is
based on two steps. In the first step, called predictor step, an approximation of
the point that lies on the solution curve is computed. If the control parameter
dg_max
is equal to zero, in this step it is also computed an approximation
of the optimal step size using a generalization of the method proposed in Efron
et al. (2004). The optimal step size is defined as the reduction of the tuning parameter,
denoted by d\gamma
, such that at \gamma-d\gamma
there is a change in the
active set. In the second step, called corrector step, a Newton-Raphson algorithm is used to
correct the approximation previously computed. The main problem of this algorithm is that the
number of arithmetic operations required to compute the approximation scales as the cube
of the variables, this means that such algorithm is cumbersome in a high dimensional setting.
To overcome this problem, the second algorithm compute the dgLARS solution curve using an
adaptive version of the cyclic coordinate descent method proposed in Friedman et al. (2010).
The argument control
is a list that can supply any of the following components:
algorithm
:a string specifying the algorithm used to compute the solution curve. The predictor-corrector algorithm is used when
algorithm = ''pc''
(default), while the cyclic coordinate descent method is used settingalgorithm = ''ccd''
;method
:a string by means of to specify the kind of solution curve. If
method = ''dgLASSO''
(default) the algorithm computes the solution curve defined by the differential geometric generalization of the LASSO estimator; otherwise (method = ''dgLARS''
) the differential geometric generalization of the least angle regression method is used;nv
:control parameter for the
pc
algorithm. An integer value between 1 and\min(n,p)
used to specify the maximum number of variables in the final model. Default isnv = min(n - 1, p)
;np
:control parameter for the
pc/ccd
algorithm. A non negative integer used to define the maximum number of solution points. For the predictor-corrector algorithmnp
is set to50\times\min(n - 1, p)
(default); for the cyclic coordinate descent method, ifg
is not specified, this argument is set equal to 100 (default);g0
:control parameter for the
pc/ccd
algorithm. This parameter is used to set the smallest value for the tuning parameter\gamma
. Default isg0 = ifelse(p < n, 1.0e-04, 0.05)
; this argument is not required wheng
is used with the cyclic coordinate descent algorithm;dg_max
:control parameter for the
pc
algorithm. A non negative value used to specify the largest value for the step size. Settingdg_max = 0
(default) the predictor-corrector algorithm computes an approximation of the optimal step size (see Augugliaro et al. (2013) for more details);nNR
:control criterion parameter for the
pc
algorithm. A non negative integer used to specify the maximum number of iterations of the Newton-Raphson algorithm. Default isnNR = 50
;NReps
:control parameter for the
pc
algorithm. A non negative value used to define the convergence of the Newton-Raphson algorithm. Default isNReps = 1.0e-06
;ncrct
:control parameter for the
pc
algorithm. When the Newton-Raphson algorithm does not converge, the step size (d\gamma
) is reduced byd\gamma = cf \cdot d\gamma
and the corrector step is repeated.ncrct
is a non negative integer used to specify the maximum number of trials for the corrector step. Default isncrct = 50
;cf
:control parameter for the
pc
algorithm. The contractor factor is a real value belonging to the interval[0,1]
used to reduce the step size as previously described. Default iscf = 0.5
;nccd
:control parameter for the
ccd
algorithm. A non negative integer used to specify the maximum number for steps of the cyclic coordinate descent algorithm. Default is1.0e+05
.eps
control parameter for the
pc/ccd
algorithm. The meaning of this parameter is related to the algorithm used to estimate the solution curve:i.
if
algorithm = ''pc''
theneps
is useda.
to identify a variable that will be included in the active set (absolute value of the corresponding Rao's score test statistic belongs to
[\gamma - \code{eps}, \gamma + \code{eps}]
);b.
to establish if the corrector step must be repeated;
c.
to define the convergence of the algorithm, i.e., the actual value of the tuning parameter belongs to the interval
[\code{g0 - eps},\code{g0 + eps}]
;
ii.
if
algorithm = ''ccd''
theneps
is used to define the convergence for a single solution point, i.e., each inner coordinate-descent loop continues until the maximum change in the Rao's score test statistic, after any coefficient update, is less thaneps
.
Default is
eps = 1.0e-05.
Value
dglars
returns an object with S3 class “dglars
”, i.e., a list containing the
following components:
call |
the call that produced this object; |
formula |
if the model is fitted by |
family |
a description of the error distribution used in the model; |
unpenalized |
the vector used to specify the unpenalized estimators; |
np |
the number of points of the dgLARS solution curve; |
beta |
the |
phi |
the |
ru |
the matrix of the Rao's score test statistics of the variables included in the final model. This component is reported only if the predictor-corrector algorithm is used to fit the model; |
dev |
the |
nnonzero |
the sequence of number of nonzero coefficients for each value of the
tuning parameter |
g |
the sequence of |
X |
the used design matrix; |
y |
the used response vector; |
w |
the vector of weights used to compute the adaptive dglars method; |
action |
a |
conv |
an integer value used to encode the warnings and the errors related to the algorithm used to fit the model. The values returned are:
|
control |
the list of control parameters used to compute the dgLARS solution curve. |
Author(s)
Luigi Augugliaro and Hassan Pazira
Maintainer: Luigi Augugliaro luigi.augugliaro@unipa.it
References
Augugliaro L., Mineo A.M. and Wit E.C. (2016) <doi:10.1093/biomet/asw023> A differential-geometric approach to generalized linear models with grouped predictors, Biometrika, Vol 103(3), 563-577.
Augugliaro L., Mineo A.M. and Wit E.C. (2014) <doi:10.18637/jss.v059.i08> dglars: An R Package to Estimate Sparse Generalized Linear Models, Journal of Statistical Software, Vol 59(8), 1-40. https://www.jstatsoft.org/v59/i08/.
Augugliaro L., Mineo A.M. and Wit E.C. (2013) <doi:10.1111/rssb.12000> dgLARS: a differential geometric approach to sparse generalized linear models, Journal of the Royal Statistical Society. Series B., Vol 75(3), 471-498.
Efron B., Hastie T., Johnstone I. and Tibshirani R. (2004) <doi:10.1214/009053604000000067> Least Angle Regression, The Annals of Statistics, Vol. 32(2), 407-499.
Friedman J., Hastie T. and Tibshirani R. (2010) <doi:10.18637/jss.v033.i01> Regularization Paths for Generalized Linear Models via Coordinate Descent, Journal of Statistical Software, Vol. 33(1), 1-22.
Pazira H., Augugliaro L. and Wit E.C. (2018) <doi:10.1007/s11222-017-9761-7> Extended di erential geometric LARS for high-dimensional GLMs with general dispersion parameter, Statistics and Computing, Vol. 28(4), 753-774.
See Also
coef.dglars
, phihat
, plot.dglars
, print.dglars
and summary.dglars
methods.
Examples
set.seed(123)
#############################
# y ~ Binomial
n <- 100
p <- 100
X <- matrix(rnorm(n * p), n, p)
eta <- 1 + 2 * X[,1]
mu <- binomial()$linkinv(eta)
y <- rbinom(n, 1, mu)
fit <- dglars(y ~ X, family = binomial)
fit
# adaptive dglars method
b_wght <- coef(fit)$beta[, 20]
fit <- dglars(y ~ X, family = binomial, b_wght = b_wght)
fit
# the first three coefficients are not penalized
fit <- dglars(y ~ X, family = binomial, unpenalized = 1:3)
fit
# 'probit' link function
fit <- dglars(y ~ X, family = binomial("probit"))
fit
############################
# y ~ Poisson
n <- 100
p <- 100
X <- matrix(rnorm(n * p), n, p)
eta <- 2 + 2 * X[,1]
mu <- poisson()$linkinv(eta)
y <- rpois(n, mu)
fit <- dglars(y ~ X, family = poisson)
fit
############################
# y ~ Gamma
n <- 100
p <- 100
X <- matrix(abs(rnorm(n*p)),n,p)
eta <- 1 + 2 * X[,1]
mu <- drop(Gamma()$linkinv(eta))
shape <- 0.5
phi <- 1 / shape
y <- rgamma(n, scale = mu / shape, shape = shape)
fit <- dglars(y ~ X, Gamma("log"))
fit