R: Fit multivariate GLM sparse regression

MGLMsparsereg {MGLM}

R Documentation

Fit multivariate GLM sparse regression

Description

Fit sparse regression in multivariate generalized linear models.

Usage

MGLMsparsereg(
  formula,
  data,
  dist,
  lambda,
  penalty,
  weight,
  init,
  penidx,
  maxiters = 150,
  ridgedelta,
  epsilon = 1e-05,
  regBeta = FALSE,
  overdisp
)

MGLMsparsereg.fit(
  Y,
  X,
  dist,
  lambda,
  penalty,
  weight,
  init,
  penidx,
  maxiters = 150,
  ridgedelta,
  epsilon = 1e-05,
  regBeta = FALSE,
  overdisp
)

Arguments

`formula`	an object of class `formula` (or one that can be coerced to that class): a symbolic description of the model to be fitted. The response has to be on the left hand side of ~.
`data`	an optional data frame, list or environment (or object coercible by `as.data.frame` to a data frame) containing the variables in the model. If not found in `data` when using function `MGLMsparsereg`, the variables are taken from `environment(formula)`, typically the environment from which `MGLMsparsereg` is called.
`dist`	a description of the error distribution to fit. See `dist` for details.
`lambda`	penalty parameter.
`penalty`	penalty type for the regularization term. Can be chosen from `"sweep"`, `"group"`, or `"nuclear"`. See Details for the description of each penalty type.
`weight`	an optional vector of weights assigned to each row of the data. Should be `NULL` or a numeric vector. Could be a variable from `data`, or a variable from `environment(formula)` with the length equal to the number of rows of the data. If `weight=NULL`, equal weights of ones will be assigned.
`init`	an optional matrix of initial value of the parameter estimates. Should have the compatible dimension with the data. See `dist` for details of the dimensions in each distribution.
`penidx`	a logical vector indicating the variables to be penalized. The default value is `rep(TRUE, p)`, which means all predictors are subject to regularization. If `X` contains intercept, use `penidx=c(FALSE,rep(TRUE,p-1))`.
`maxiters`	an optional numeric controlling the maximum number of iterations. The default value is maxiters=150.
`ridgedelta`	an optional numeric controlling the behavior of the Nesterov's accelerated proximal gradient method. The default value is `\frac{1}{pd}`.
`epsilon`	an optional numeric controlling the stopping criterion. The algorithm terminates when the relative change in the objective values of two successive iterates is less then `epsilon`. The default value is `epsilon=1e-5`.
`regBeta`	an optional logical variable used when running negative multinomial regression (`dist="NegMN"`). `regBeta` controls whether to run regression on the over-dispersion parameter. The default is `regBeta=FALSE`.
`overdisp`	an optional numerical variable used only when fitting sparse negative multinomial model `dist="NegMN"` and `regBeta=FALSE`. `overdisp` gives the over dispersion value for all the observations. The default value is estimated using negative-multinomial regression. When `dist="MN", "DM", "GDM"` or `regBeta=TRUE`, the value of `overdisp` is ignored.
`Y`	a matrix containing the multivariate categorical response data. Rows of the matrix represent observations, while columns are the different categories. Rows and columns of all zeros are automatically removed when `dist="MN"`, `"DM"`, or `"GDM"`.
`X`	design matrix (including intercept). Number of rows of the matrix should match that of `Y`.

Details

In general, we consider regularization problem

\min_B h(B) = -l(B)+ J(B),

where l(B) is the loglikelihood function and J(B) is the regularization function.

Sparsity in the individual elements of the parameter matrix B is achieved by the lasso penalty (dist="sweep")

J(B) = \lambda \sum_{k\in penidx} \sum_{j=1}^d \|B_{kj}\|

Sparsity in the rows of the regression parameter matrix B is achieved by the group penalty (dist="group")

J(B) = \lambda \sum_{k \in penidx} \|B_{k \cdot}\|_2,

where \|v\|_2 is the l_2 norm of a vector v. In other words, \|B_{k\cdot}\|_2 is the l_2 norm of the k-th row of the parameter matrix B.

Sparsity in the rank of the parameter matrix B is achieved by the nuclear norm penalty (dist="nuclear")

J(B) = \lambda \|B\|_*= \lambda \sum_{i=1}^{min(p, d)} \sigma_i(B),

where \sigma_i(B) are the singular values of the parameter matrix B. The nuclear norm \|B\|_* is a convex relaxation of rank(B)=\|\sigma(B)\|_0.

See dist for details about distributions.

Value

Returns an object of class "MGLMsparsereg". An object of class "MGLMsparsereg" is a list containing at least the following components:

coefficients the estimated matrix of regression coefficients.
logL the final loglikelihood value.
AIC Akaike information criterion.
BIC Bayesian information criterion.
Dof degrees of freedom of the estimated parameter.
iter number of iterations used.
maxlambda the maxmum tuning parameter such that the estimated coefficients are not all zero. This value is returned only when the tuning parameter lambda given to the function is large enough such that all the parameter estimates are zero; otherwise, maxlambda is not computed.
call a matched call.
data the data used to fit the model: a list of the predictor matrix and the response matrix.
penalty the penalty chosen when running the penalized regression.

Author(s)

Yiwen Zhang and Hua Zhou

Examples

## Generate Dirichlet Multinomial data
dist <- "DM"
n <- 100
p <- 15
d <- 5
m <- runif(n, min=0, max=25) + 25
set.seed(134)
X <- matrix(rnorm(n*p),n, p)
alpha <- matrix(0, p, d)
alpha[c(1,3, 5), ] <- 1
Alpha <- exp(X%*%alpha)
Y <- rdirmn(size=m, alpha=Alpha)

## Tuning
ngridpt <- 10
p <- ncol(X)
d <- ncol(Y)
pen <- 'nuclear'
spfit <- MGLMsparsereg(formula=Y~0+X, dist=dist, lambda=Inf, penalty=pen)

[Package MGLM version 0.2.1 Index]