sox_cv {sox}R Documentation

cross-validation for sox

Description

Conduct cross-validation (cv) for sox.

Usage

sox_cv(
  x,
  ID,
  time,
  time2,
  event,
  penalty,
  lambda,
  group,
  group_variable,
  own_variable,
  no_own_variable,
  penalty_weights,
  par_init,
  nfolds = 10,
  foldid = NULL,
  stepsize_init = 1,
  stepsize_shrink = 0.8,
  tol = 1e-05,
  maxit = 1000L,
  verbose = FALSE
)

Arguments

x

Predictor matrix with dimension nm * p, where n is the number of subjects, m is the maximum observation time, and p is the number of predictors. See Details.

ID

The ID of each subjects, each subject has one ID (multiple rows in x can share one ID).

time

Represents the start of each time interval.

time2

Represents the stop of each time interval.

event

Indicator of event. event = 1 when event occurs and event = 0 otherwise.

penalty

Character string, indicating whether "overlapping" or "nested" group lasso penalty is imposed.

lambda

Sequence of regularization coefficients \lambda's.

group

A G * G integer matrix required to describe the structure of the overlapping and nested groups. We recommend that the users generate it automatically using overlap_structure() and nested_structure(). See Examples and Details.

group_variable

A p * G integer matrix required to describe the structure of the overlapping groups. We recommend that the users generate it automatically using overlap_structure(). See Examples and Details.

own_variable

A non-decreasing integer vector of length G required to describe the structure of the nested groups. We recommend that the users generate it automatically using nested_structure(). See Examples and Details.

no_own_variable

An integer vector of length G required to describe the structure of the nested groups. We recommend that the users generate it automatically using nested_structure(). See Examples and Details

penalty_weights

Optional, vector of length G specifying the group-specific penalty weights. We recommend that the users generate it automatically using overlap_structure() or nested_structure(). If not specified, \mathbf{1}_G is used.

par_init

Optional, vector of initial values of the optimization algorithm. Default initial value is zero for all p variables.

nfolds

Optional, the folds of cross-validation. Default is 10.

foldid

Optional, user-specified vector indicating the cross-validation fold in which each observation should be included. Values in this vector should range from 1 to nfolds. If left unspecified, sox will randomly assign observations to folds

stepsize_init

Initial value of the stepsize of the optimization algorithm. Default is 1.

stepsize_shrink

Factor in (0,1) by which the stepsize shrinks in the backtracking linesearch. Default is 0.8.

tol

Convergence criterion. Algorithm stops when the l_2 norm of the difference between two consecutive updates is smaller than tol.

maxit

Maximum number of iterations allowed.

verbose

Logical, whether progress is printed.

Details

For each lambda, 10 folds cross-validation (by default) is performed. The cv error is defined as follows. Suppose we perform K-fold cross-validation, denote \hat{\beta}^{-k} by the estimate obtained from the rest of K-1 folds (training set). The error of the k-th fold (test set) is defined as 2(P-Q) divided by R, where P is the log partial likelihood evaluated at \hat{\beta}^{-k} using the entire dataset, Q is the log partial likelihood evaluated at \hat{\beta}^{-k} using the training set, and R is the number of events in the test set. We do not use the negative log partial likelihood evaluated at \hat{\beta}^{-k} using the test set because the former definition can efficiently use the risk set, and thus it is more stable when the number of events in each test set is small (think of leave-one-out). The cv error is used in parameter tuning. To account for balance in outcomes among the randomly formed test set, we divide the deviance 2(P-Q) by R. To get the estimated coefficients that has the minimum cv error, use sox_cv()$Estimates[, sox_cv$index["min",]]. To apply the 1-se rule, use sox_cv()$Estimates[, sox_cv$index["1se",]].

Value

A list.

lambdas

A vector of lambda used for each cross-validation.

cvm

The cv error averaged across all folds for each lambda.

cvsd

The standard error of the cv error for each lambda.

cvup

The cv error plus its standard error for each lambda.

cvlo

The cv error minus its standard error for each lambda.

nzero

The number of non-zero coefficients at each lambda.

sox.fit

A fitted model for the full data at all lambdas of class "sox".

lambda.min

The lambda such that the cvm reach its minimum.

lambda.1se

The maximum of lambda such that the cvm is less than the minimum the cvup (the minmum of cvm plus its standard error).

foldid

The fold assignments used.

index

A one column matrix with the indices of lambda.min and lambda.1se.

iterations

A vector of number of iterations it takes to converge at each \lambda in lambdas.

See Also

sox, plot.sox_cv.

Examples

x <- as.matrix(sim[, c("A1","A2","C1","C2","B","A1B","A2B","C1B","C2B")])
lam.seq <- exp(seq(log(1e0), log(1e-3), length.out = 20))

# Variables:
## 1: A1
## 2: A2
## 3: C1
## 4: C2
## 5: B
## 6: A1B
## 7: A2B
## 8: C1B
## 9: C2B

# Overlapping groups:
## g1: A1, A2, A1B, A2B
## g2: B, A1B, A2B, C1B, C2B
## g3: A1B, A2B
## g4: C1, C2, C1B, C2B
## g5: C1B, C2B

overlapping.groups <- list(c(1, 2, 6, 7),
                           c(5, 6, 7, 8, 9),
                           c(6, 7),
                           c(3, 4, 8, 9),
                           c(8, 9))
                           
pars.overlapping <- overlap_structure(overlapping.groups)

cv.overlapping <- sox_cv(
  x = x,
  ID = sim$Id,
  time = sim$Start,
  time2 = sim$Stop,
  event = sim$Event,
  penalty = "overlapping",
  lambda = lam.seq,
  group = pars.overlapping$groups,
  group_variable = pars.overlapping$groups_var,
  penalty_weights = pars.overlapping$group_weights,
  nfolds = 5,
  tol = 1e-4,
  maxit = 1e3,
  verbose = FALSE
)

str(cv.overlapping)

# Nested groups (misspecified, for the demonstration of the software only.)
## g1: A1, A2, C1, C2, B, A1B, A2B, C1B, C2B
## g2: A1B, A2B, A1B, A2B
## g3: C1, C2, C1B, C2B
## g4: 1
## g5: 2
## ...
## G12: 9

nested.groups <- list(1:9,
                      c(1, 2, 6, 7),
                      c(3, 4, 8, 9),
                      1, 2, 3, 4, 5, 6, 7, 8, 9)

pars.nested <- nested_structure(nested.groups)

cv.nested <- sox_cv(
  x = x,
  ID = sim$Id,
  time = sim$Start,
  time2 = sim$Stop,
  event = sim$Event,
  penalty = "nested",
  lambda = lam.seq,
  group = pars.nested$groups,
  own_variable = pars.nested$own_variables,
  no_own_variable = pars.nested$N_own_variables,
  penalty_weights = pars.nested$group_weights,
  nfolds = 5,
  tol = 1e-4,
  maxit = 1e3,
  verbose = FALSE
)

str(cv.nested)
                

[Package sox version 1.1 Index]