ordinalNet {ordinalNet} | R Documentation |
Ordinal regression models with elastic net penalty
Description
Fits ordinal regression models with elastic net penalty by coordinate descent. Supported model families include cumulative probability, stopping ratio, continuation ratio, and adjacent category. These families are a subset of vector glm's which belong to a model class we call the elementwise link multinomial-ordinal (ELMO) class. Each family in this class links a vector of covariates to a vector of class probabilities. Each of these families has a parallel form, which is appropriate for ordinal response data, as well as a nonparallel form that is appropriate for an unordered categorical response, or as a more flexible model for ordinal data. The parallel model has a single set of coefficients, whereas the nonparallel model has a set of coefficients for each response category except the baseline category. It is also possible to fit a model with both parallel and nonparallel terms, which we call the semi-parallel model. The semi-parallel model has the flexibility of the nonparallel model, but the elastic net penalty shrinks it toward the parallel model.
Usage
ordinalNet(
x,
y,
alpha = 1,
standardize = TRUE,
penaltyFactors = NULL,
positiveID = NULL,
family = c("cumulative", "sratio", "cratio", "acat"),
reverse = FALSE,
link = c("logit", "probit", "cloglog", "cauchit"),
customLink = NULL,
parallelTerms = TRUE,
nonparallelTerms = FALSE,
parallelPenaltyFactor = 1,
lambdaVals = NULL,
nLambda = 20,
lambdaMinRatio = 0.01,
includeLambda0 = FALSE,
alphaMin = 0.01,
pMin = 1e-08,
stopThresh = 1e-08,
threshOut = 1e-08,
threshIn = 1e-08,
maxiterOut = 100,
maxiterIn = 100,
printIter = FALSE,
printBeta = FALSE,
warn = TRUE,
keepTrainingData = TRUE
)
Arguments
x |
Covariate matrix. It is recommended that categorical covariates are converted to a set of indicator variables with a variable for each category (i.e. no baseline category); otherwise the choice of baseline category will affect the model fit. |
y |
Response variable. Can be a factor, ordered factor, or a matrix where each row is a multinomial vector of counts. A weighted fit can be obtained using the matrix option, since the row sums are essentially observation weights. Non-integer matrix entries are allowed. |
alpha |
The elastic net mixing parameter, with |
standardize |
If |
penaltyFactors |
Optional nonnegative vector of penalty factors with
length equal to the number of columns in |
positiveID |
Logical vector indicating whether each coefficient should
be constrained to be non-negative. If |
family |
Specifies the type of model family. Options are "cumulative" for cumulative probability, "sratio" for stopping ratio, "cratio" for continuation ratio, and "acat" for adjacent category. |
reverse |
Logical. If |
link |
Specifies the link function. The options supported are logit,
probit, complementary log-log, and cauchit. Only used if |
customLink |
Optional list containing a vectorized link function |
parallelTerms |
Logical. If |
nonparallelTerms |
Logical. if |
parallelPenaltyFactor |
Nonnegative numeric value equal to one by
default. The penalty on all parallel terms is scaled by this factor (as well
as variable-specific |
lambdaVals |
An optional user-specified lambda sequence (vector). If |
nLambda |
Positive integer. The number of lambda values in the solution path.
Only used if |
lambdaMinRatio |
A factor greater than zero and less than one. Only used
if |
includeLambda0 |
Logical. If |
alphaMin |
|
pMin |
Value greater than zero, but much less than one. During the optimization routine, the Fisher information is calculated using fitted probabilities. For this calculation, fitted probabilities are capped below by this value to prevent numerical instability. |
stopThresh |
In the relative log-likelihood change between successive lambda values falls below this threshold, then the last model fit is used for all remaining lambda. |
threshOut |
Convergence threshold for the coordinate descent outer loop.
The optimization routine terminates when the relative change in the
penalized log-likelihood between successive iterations falls below this threshold.
It is recommended to set |
threshIn |
Convergence threshold for the coordinate descent inner loop. Each
iteration consists of a single loop through each coefficient. The inner
loop terminates when the relative change in the penalized approximate
log-likelihood between successive iterations falls below this threshold.
It is recommended to set |
maxiterOut |
Maximum number of outer loop iterations. |
maxiterIn |
Maximum number of inner loop iterations. |
printIter |
Logical. If |
printBeta |
Logical. If |
warn |
Logical. If |
keepTrainingData |
Logical. If |
Details
The ordinalNet
function fits regression models for a categorical response
variable with K+1
levels. Conditional on the covariate vector x_i
(the i^{th}
row of x
), each observation has a vector of K+1
class probabilities (p_{i1}, \ldots, p_{i(K+1)})
. These probabilities
sum to one, and can therefore be parametrized by p_i = (p_{i1}, \ldots, p_{iK})
.
The probabilities are mapped to a set of K
quantities
\delta_i = (\delta_{i1}, \ldots, \delta_{iK}) \in (0, 1)^K
, which depends on the choice
of model family
. The elementwise link
function maps
\delta_i
to a set of K
linear predictors. Together, the family
and link
specifiy a link function between p_i
and \eta_i
.
Model families:
Let Y
denote the random response variable for a single observation,
conditional on the covariates values of the observation. The random variable
Y
is discrete with support {1, \ldots, K+1
}. The following model
families are defined according to these mappings between the class
probabilities and the values \delta_1, \ldots, \delta_K
:
- Cumulative probability
\delta_j = P(Y \le j)
- Reverse cumulative probability
\delta_j = P(Y \ge j + 1)
- Stopping ratio
\delta_j = P(Y = j | Y \ge j)
- Reverse stopping ratio
\delta_j = P(Y=j + 1 | Y \le j + 1)
- Continuation ratio
\delta_j = P(Y > j | Y \ge j)
- Reverse continuation ratio
\delta_j = P(Y < j | Y \le j)
- Adjacent category
\delta_j = P(Y = j + 1 | j \le Y \le j+1)
- Reverse adjacent category
\delta_j = P(Y = j | j \le Y \le j+1)
Parallel, nonparallel, and semi-parallel model forms:
Models within each of these families can take one of three forms, which have
different definitions for the linear predictor \eta_i
. Suppose each
x_i
has length P
. Let b
be a length P
vector of
regression coefficients. Let B
be a P \times K
matrix of regression
coefficient. Let b_0
be a vector of K
intercept terms.
The three model forms are the following:
- Parallel
\eta_i = b_0 + b^T x_i
(parallelTerms=TRUE
,nonparallelTerms=FALSE
)- Nonparallel
\eta_i = b_0 + B^T x_i
(parallelTerms=FALSE
,nonparallelTerms=TRUE
)- Semi-parallel
\eta_i = b_0 + b^T x_i + B^T x_i
(parallelTerms=TRUE
,nonparallelTerms=TRUE
)
The parallel form has the defining property of ordinal models, which is that
a single linear combination b^T x_i
shifts the cumulative class probabilities
P(Y \le j)
in favor of either higher or lower categories. The linear predictors
are parallel because they only differ by their intercepts (b_0
). The nonparallel form
is a more flexible model, and it does not shift the cumulative probabilities together.
The semi-parallel model is equivalent to the nonparallel model, but the
elastic net penalty shrinks the semi-parallel coefficients toward a common
value (i.e. the parallel model), as well as shrinking all coefficients toward zero.
The nonparallel model, on the other hand, simply shrinks all coefficients toward zero.
When the response categories are ordinal, any of the three model forms could
be applied. When the response categories are unordered, only the nonparallel
model is appropriate.
Elastic net penalty:
The elastic net penalty is defined for each model form as follows. \lambda
and \alpha
are the usual elastic net tuning parameters, where \lambda
determines the degree to which coefficients are shrunk toward zero, and \alpha
specifies the amound of weight given to the L1 norm and squared L2 norm penalties.
Each covariate is allowed a unique penalty factor c_j
, which is specified with the
penaltyFactors
argument. By default c_j = 1
for all j
.
The semi-parallel model has a tuning parameter \rho
which determines the degree to
which the parallel coefficients are penalized. Small values of \rho
will
result in a fit closer to the parallel model, and large values of \rho
will result in a fit closer to the nonparallel model.
- Parallel
\lambda \sum_{j=1}^P c_j \{ \alpha |b_j| + \frac{1}{2} (1-\alpha) b_j^2 \}
- Nonparallel
\lambda \sum_{j=1}^P c_j \{ \sum_{k=1}^K \alpha |B_{jk}| + \frac{1}{2} (1-\alpha) B_{jk}^2 \}
- Semi-parallel
\lambda [ \rho \sum_{j=1}^P c_j \{ \alpha |b_j| + \frac{1}{2} (1-\alpha) b_j^2 \} + \sum_{j=1}^P c_j \{ \sum_{k=1}^K \alpha |B_{jk}| + \frac{1}{2} (1-\alpha) B_{jk}^2 \}]
ordinalNet
minimizes the following objective function. Let N
be
the number of observations, which is defined as the sum of the y
elements
when y
is a matrix.
objective = -1/N*loglik + penalty
Value
An object with S3 class "ordinalNet". Model fit information can be accessed
through the coef
, predict
, and summary
methods.
- coefs
Matrix of coefficient estimates, with each row corresponding to a lambda value. (If covariates were scaled with
standardize=TRUE
, the coefficients are returned on the original scale).- lambdaVals
Sequence of lambda values. If user passed a sequence to the
lambdaVals
, then it is this sequence. IflambdaVals
argument wasNULL
, then it is the sequence generated.- loglik
Log-likelihood of each model fit.
- nNonzero
Number of nonzero coefficients of each model fit, including intercepts.
- aic
AIC, defined as
-2*loglik + 2*nNonzero
.- bic
BIC, defined as
-2*loglik + log(N)*nNonzero
.- devPct
Percentage deviance explained, defined as
1 - loglik/loglik_0
, whereloglik_0
is the log-likelihood of the null model.- iterOut
Number of coordinate descent outer loop iterations until convergence for each lambda value.
- iterIn
Number of coordinate descent inner loop iterations on last outer loop for each lambda value.
- dif
Relative improvement in objective function on last outer loop for each lambda value. Can be used to diagnose convergence issues. If
iterOut
reachedmaxiterOut
anddif
is large, thenmaxiterOut
should be increased. Ifdif
is negative, this means the objective did not improve between successive iterations. This usually only occurs when the model is saturated and/or close to convergence, so a small negative value is not of concern. (When this happens, the algorithm is terminated for the current lambda value, and the coefficient estimates from the previous outer loop iteration are returned.)- nLev
Number of response categories.
- nVar
Number of covariates in
x
.- xNames
Covariate names.
- args
List of arguments passed to the
ordinalNet
function.
Examples
# Simulate x as independent standard normal
# Simulate y|x from a parallel cumulative logit (proportional odds) model
set.seed(1)
n <- 50
intercepts <- c(-1, 1)
beta <- c(1, 1, 0, 0, 0)
ncat <- length(intercepts) + 1 # number of response categories
p <- length(beta) # number of covariates
x <- matrix(rnorm(n*p), ncol=p) # n x p covariate matrix
eta <- c(x %*% beta) + matrix(intercepts, nrow=n, ncol=ncat-1, byrow=TRUE)
invlogit <- function(x) 1 / (1+exp(-x))
cumprob <- t(apply(eta, 1, invlogit))
prob <- cbind(cumprob, 1) - cbind(0, cumprob)
yint <- apply(prob, 1, function(p) sample(1:ncat, size=1, prob=p))
y <- as.factor(yint)
# Fit parallel cumulative logit model
fit1 <- ordinalNet(x, y, family="cumulative", link="logit",
parallelTerms=TRUE, nonparallelTerms=FALSE)
summary(fit1)
coef(fit1)
coef(fit1, matrix=TRUE)
predict(fit1, type="response")
predict(fit1, type="class")
# Fit nonparallel cumulative logit model
fit2 <- ordinalNet(x, y, family="cumulative", link="logit",
parallelTerms=FALSE, nonparallelTerms=TRUE)
fit2
coef(fit2)
coef(fit2, matrix=TRUE)
predict(fit2, type="response")
predict(fit2, type="class")
# Fit semi-parallel cumulative logit model (with both parallel and nonparallel terms)
fit3 <- ordinalNet(x, y, family="cumulative", link="logit",
parallelTerms=TRUE, nonparallelTerms=TRUE)
fit3
coef(fit3)
coef(fit3, matrix=TRUE)
predict(fit3, type="response")
predict(fit3, type="class")