R: Cross-Validation

gesso.cv {gesso}

R Documentation

Cross-Validation

Description

Performs nfolds-fold cross-validation to tune hyperparmeters lambda_1 and lambda_2 for the gesso model.

Usage

gesso.cv(G, E, Y, C = NULL, normalize = TRUE, normalize_response = FALSE, grid = NULL,
         grid_size = 20, grid_min_ratio = NULL, alpha = NULL, family = "gaussian", 
         type_measure = "loss", fold_ids = NULL, nfolds = 4, 
         parallel = TRUE, seed = 42, tolerance = 1e-3, max_iterations = 5000, 
         min_working_set_size = 100, verbose = TRUE)

Arguments

`G`	matrix of main effects of size `n x p`, variables organized by columns
`E`	vector of environmental measurments
`Y`	outcome vector. Set `family="gaussian"` for the continuous outcome and `family="binomial"` for the binary outcome with 0/1 levels
`C`	matrix of confounders of size `n x m`, variables organized by columns
`normalize`	`TRUE` to normalize matrix `G` and vector `E`
`normalize_response`	`TRUE` to normalize vector `Y` (for `family="gaussian"`)
`grid`	grid sequence for tuning hyperparameters, we use the same grid for `lambda_1` and `lambda_2`
`grid_size`	specify `grid_size` to generate grid automatically. Grid is generated by calculating `max_lambda` from the data (smallest lambda such that all the coefficients are zero). `min_lambda` is calculated as a product of `max_lambda` and `grid_min_ratio`. The program then generates `grid_size` values equidistant on the log10 scale from `min_lambda` to `max_lambda`
`grid_min_ratio`	parameter to determine `min_lambda` (smallest value for the grid of lambdas), default is 0.1 for p > n, 0.01 otherwise
`alpha`	if `NULL` independent 2D grid is used for (`lambda_1`, `lambda_2`), else 1D grid is used where `lambda_2` = `alpha` * `lambda_1`, i.e. (`lambda_1`, `alpha` * `lambda_1`)
`family`	`"gaussian"` for continuous outcome and `"binomial"` for binary
`type_measure`	loss to use for cross-validation. Specity `type_measure="loss"` for neative log likelihood or `type_measure="auc"` for AUC (for `family="binomial"` only)
`fold_ids`	option to input custom folds assignments
`tolerance`	tolerance for the dual gap convergence criterion
`max_iterations`	maximum number of iterations
`min_working_set_size`	minimum size of the working set
`nfolds`	number of cross-validation splits
`parallel`	`TRUE` to enable parallel cross-validation
`seed`	set random seed to control random folds assignments
`verbose`	`TRUE` to print messages

Value

A list of objects

`cv_result`	a tibble with cross-validation results: averaged across folds loss and the number of non-zero coefficients for each value of (`lambda_1`, `lambda_2`) path. Could be used for custom parameters tuning (ex: select (`lambda_1`, `lambda_2`) with a sertain number of non-zero main effects and/or a sertain number of interactions). `mean_loss` averaged across folds loss value, vector of size `lambda_1``lambda_2` `mean_beta_g_nonzero` averaged across folds number of non-zero main effects, vector of size `lambda_1``lambda_2` `mean_beta_gxe_nonzero` averaged across folds number of non-zero interactions, vector of size `lambda_1`*`lambda_2` `lambda_1` `lambda_1` pass, decreasing `lambda_2` `lambda_2` pass, oscillating
`lambda_min`	a tibble of optimal (`lambda_1`, `lambda_2`) values, tuning parameter values that give minimum cross-validation loss (`mean_loss`)
`fit`	list, return of the function gesso.fit on the full data
`grid`	vector of values used for hyperparameters tuning
`full_cv_result`	inner variables

Examples

data = data.gen()
tune_model = gesso.cv(data$G_train, data$E_train, data$Y_train, 
                      grid_size=20, parallel=TRUE, nfolds=3)
gxe_coefficients = gesso.coef(tune_model$fit, tune_model$lambda_min)$beta_gxe        
g_coefficients = gesso.coef(tune_model$fit, tune_model$lambda_min)$beta_g

[Package gesso version 1.0.2 Index]