R: Cross-validation function for fitting a regularized...

penAFT.cv {penAFT}

R Documentation

Cross-validation function for fitting a regularized semiparametric accelerated failure time model

Description

A function to perform cross-validation and compute the solution path for the regularized semiparametric accelerated failure time model estimator.

Usage

penAFT.cv(X, logY, delta, nlambda = 50, 
  lambda.ratio.min = 0.1, lambda = NULL, 
  penalty = NULL, alpha = 1,weight.set = NULL, 
  groups = NULL, tol.abs = 1e-8, tol.rel = 2.5e-4, 
  standardize = TRUE, nfolds = 5, cv.index = NULL, 
  admm.max.iter = 1e4,quiet = TRUE)

Arguments

`X`	An `n \times p` matrix of predictors. Observations should be organized by row.
`logY`	An `n`-dimensional vector of log-survival or log-censoring times.
`delta`	An `n`-dimensional binary vector indicating whether the `j`th component of `logY` is an observed log-survival time (`\delta_j = 1`) or a log-censoring time (`\delta_j = 0`) for `j=1, \dots, n`.
`nlambda`	The number of candidate tuning parameters to consider.
`lambda.ratio.min`	The ratio of maximum to minimum candidate tuning parameter value. As a default, we suggest 0.1, but standard model selection procedures should be applied to select `\lambda`.
`lambda`	An optional (not recommended) prespecified vector of candidate tuning parameters. Should be in descending order.
`penalty`	Either "EN" or "SG" for elastic net or sparse group lasso penalties.
`alpha`	The tuning parameter `\alpha`. See documentation.
`weight.set`	A list of weights. For both penalties, `w` is an `n`-dimensional vector of nonnegative weights. For "SG" penalty, can also include `v` – a non-negative vector the length of the number of groups. See documentation for usage example.
`groups`	When using penalty "SG", a `p`-dimensional vector of integers corresponding the to group assignment of each predictor (i.e., column of `X`).
`tol.abs`	Absolute convergence tolerance.
`tol.rel`	Relative convergence tolerance.
`standardize`	Should predictors be standardized (i.e., scaled to have unit variance) for model fitting?
`nfolds`	The number of folds to be used for cross-validation. Default is five. Ten is recommended when sample size is especially small.
`cv.index`	A list of length `nfolds` of indices to be used for cross-validation. This is to be used if trying to perform cross-validation for both `\alpha` and `\lambda`. Use with extreme caution: this overwrites `nfolds`.
`admm.max.iter`	Maximum number of ADMM iterations.
`quiet`	`TRUE` or `FALSE` variable indicating whether progress should be printed.

Details

Given (\log y_1 , x_1, \delta_1),\dots,(\log y_n , x_n, \delta_n) where for subject i (i = 1, \dots, n), y_i is the minimum of the survival time and censoring time, x_i is a p-dimensional predictor, and \delta_i is the indicator of censoring, penAFT.cv performs nfolds cross-validation for selecting the tuning parameter to be used in the argument minimizing

\frac{1}{n^2}\sum_{i=1}^n \sum_{j=1}^n \delta_i \{ \log y_i - \log y_j - (x_i - x_j)'\beta \}^{-} + \lambda g(\beta)

where \{a \}^{-} := \max(-a, 0) , \lambda > 0, and g is either the weighted elastic net penalty (penalty = "EN") or weighted sparse group lasso penalty (penalty = "SG"). The weighted elastic net penalty is defined as

\alpha \| w \circ \beta\|_1 + \frac{(1-\alpha)}{2}\|\beta\|_2^2

where w is a set of non-negative weights (which can be specified in the weight.set argument). The weighted sparse group-lasso penalty we consider is

\alpha \| w \circ \beta\|_1 + (1-\alpha)\sum_{l=1}^G v_l\|\beta_{\mathcal{G}_l}\|_2

where again, w is a set of non-negative weights and v_l are weights applied to each of the G groups.

Next, we define the cross-validation errors. Let \mathcal{V}_1, \dots, \mathcal{V}_K be a random nfolds = K element partition of [n] (the subjects) with the cardinality of each \mathcal{V}_k (the "kth fold"") approximately equal for k = 1, \dots, K. Let {\hat{\beta}}_{\lambda(-\mathcal{V}_k)} be the solution with tuning parameter \lambda using only data indexed by [n] \setminus \{\mathcal{V}_k\} (i.e., outside the kth fold). Then, definining e_i(\beta) := \log y_i - \beta'x_i for i= 1, \dots, n, we call

\sum_{k=1}^K \left[\frac{1}{|\mathcal{V}_k|^2} \sum_{i \in \mathcal{V}_k} \sum_{j \in \mathcal{V}_k} \delta_i \{e_i({\hat{\beta}}_{\lambda(-\mathcal{V}_k)}) - e_{j}({\hat{\beta}}_{\lambda(-\mathcal{V}_k)})\}^{-}\right],

the cross-validated Gehan loss at \lambda in the kth fold, and refer to the sum over all nfolds = K folds as the cross-validated Gehan loss. Similarly, letting letting

\tilde{e}_i({\hat{\beta}}_\lambda) = \sum_{k = 1}^K (\log y_i - x_i'{\hat{\beta}}_{\lambda(-\mathcal{V}_k)}) \mathbf{1}(i \in \mathcal{V}_k)

for each i \in [n], we call

\left[\sum_{i = 1}^n \sum_{j = 1}^n \delta_i \{\tilde{e}_i({\hat{\beta}}_\lambda) - \tilde{e}_j({\hat{\beta}}_\lambda)\}^{-}\right]

the cross-validated linear predictor score at \lambda.

Value

`full.fit`	A model fit with the same output as a model fit using `penAFT`. See documentation for `penAFT` for more.
`cv.err.linPred`	A `nlambda`-dimensional vector of cross-validated linear predictor scores.
`cv.err.obj`	A `nfolds` `\timesnlambda` matrix of cross-valdiation Gehan losses.
`cv.index`	A list of length `nfolds`. Each element contains the indices for subjects belonging to that particular fold.

Examples

 # --------------------------------------
# Generate data  
# --------------------------------------
set.seed(1)
genData <- genSurvData(n = 50, p = 50, s = 10, mag = 2,  cens.quant = 0.6)
X <- genData$X
logY <- genData$logY
delta <- genData$status
p <- dim(X)[2]

# -----------------------------------------------
# Fit elastic net penalized estimator
# -----------------------------------------------
fit.en <- penAFT.cv(X = X, logY = logY, delta = delta,
                   nlambda = 10, lambda.ratio.min = 0.1,
                   penalty = "EN", nfolds = 5,
                   alpha = 1)
# ---- coefficients at tuning parameter minimizing cross-valdiation error
coef.en <- penAFT.coef(fit.en)

# ---- predict at 8th tuning parameter from full fit
Xnew <- matrix(rnorm(10*p), nrow=10)
predict.en <- penAFT.predict(fit.en, Xnew = Xnew, lambda = fit.en$full.fit$lambda[8])


  # -----------------------------------------------
  # Fit sparse group penalized estimator
  # -----------------------------------------------
  groups <- rep(1:5, each = 10)
  fit.sg <- penAFT.cv(X = X, logY = logY, delta = delta,
                    nlambda = 50, lambda.ratio.min = 0.01,
                    penalty = "SG", groups = groups, nfolds = 5,
                    alpha = 0.5)
                     
  # -----------------------------------------------
  # Pass fold indices
  # -----------------------------------------------
  groups <- rep(1:5, each = 10)
  cv.index <- list()
  for(k in 1:5){
    cv.index[[k]] <- which(rep(1:5, length=50) == k)
  }
  fit.sg.cvIndex <- penAFT.cv(X = X, logY = logY, delta = delta,
                    nlambda = 50, lambda.ratio.min = 0.01,
                    penalty = "SG", groups = groups, 
                    cv.index = cv.index,
                    alpha = 0.5)
  # --- compare cv indices
  ## Not run: fit.sg.cvIndex$cv.index  == cv.index

[Package penAFT version 0.3.0 Index]