R: Stepwise search for the best subset of genetic variants or...

stepwise_search {LEGIT}

R Documentation

Stepwise search for the best subset of genetic variants or environments with the LEGIT model

Description

[Fast, recommended for small number of variables] Adds the best variable or drops the worst variable one at a time in the genetic (if search="genes") or environmental score (if search="env"). You can select the desired search criterion (AIC, BIC, cross-validation error, cross-validation AUC) to determine which variable is the best/worst and should be added/dropped. Note that when the number of variables in G and E is large, this does not generally converge to the optimal subset, this function is only recommended when you have a small number of variables (e.g. 2 environments, 6 genetic variants). If using cross-validation (search_criterion="cv" or search_criterion="cv_AUC"), to prevent cross-validating with each variable (extremely slow), we recommend setting a p-value threshold (p_threshold) and forcing the algorithm not to look at models with bigger AIC (exclude_worse_AIC=TRUE).

Usage

stepwise_search(
  data,
  formula,
  interactive_mode = FALSE,
  genes_original = NULL,
  env_original = NULL,
  genes_extra = NULL,
  env_extra = NULL,
  search_type = "bidirectional-forward",
  search = "both",
  search_criterion = "AIC",
  forward_exclude_p_bigger = 0.2,
  backward_exclude_p_smaller = 0.01,
  exclude_worse_AIC = TRUE,
  max_steps = 100,
  cv_iter = 5,
  cv_folds = 10,
  folds = NULL,
  Huber_p = 1.345,
  classification = FALSE,
  start_genes = NULL,
  start_env = NULL,
  eps = 0.01,
  maxiter = 100,
  family = gaussian,
  ylim = NULL,
  seed = NULL,
  print = TRUE,
  remove_miss = FALSE,
  test_only = FALSE
)

Arguments

`data`	data.frame of the dataset to be used.
`formula`	Model formula. Use E for the environmental score and G for the genetic score. Do not manually code interactions, write them in the formula instead (ex: GEz or G:E:z).
`interactive_mode`	If TRUE, uses interactive mode. In interactive mode, at each iteration, the user is shown the AIC, BIC, p-value and also the cross-validation `R^2` if `search_criterion="cv"` and the cross-validation AUC if `search_criterion="cv_AUC"` for the best 5 variables. The user must then enter a number between 1 and 5 to select the variable to be added, entering anything else will stop the search.
`genes_original`	data.frame of the variables inside the genetic score G (can be any sort of variable, doesn't even have to be genetic).
`env_original`	data.frame of the variables inside the environmental score E (can be any sort of variable, doesn't even have to be environmental).
`genes_extra`	data.frame of the additionnal variables to try including inside the genetic score G (can be any sort of variable, doesn't even have to be genetic). Set to NULL if using a backward search.
`env_extra`	data.frame of the variables to try including inside the environmental score E (can be any sort of variable, doesn't even have to be environmental). Set to NULL if using a backward search.
`search_type`	If `search_type="forward"`, uses a forward search. If `search_type="backward"`, uses backward search. If `search_type="bidirectional-forward"`, uses bidirectional search (that starts as a forward search). If `search_type="bidirectional-backward"`, uses bidirectional search (that starts as a backward search).
`search`	If `search="genes"`, uses a stepwise search for the genetic score variables. If `search="env"`, uses a stepwise search for the environmental score variables. If `search="both"`, uses a stepwise search for both the gene and environmental score variables (Default = "both").
`search_criterion`	Criterion used to determine which variable is the best to add or worst to drop. If `search_criterion="AIC"`, uses the AIC, if `search_criterion="AICc"`, uses the AICc, if `search_criterion="BIC"`, uses the BIC, if `search_criterion="cv"`, uses the cross-validation error, if `search_criterion="cv_AUC"`, uses the cross-validated AUC, if `search_criterion="cv_Huber"`, uses the Huber cross-validation error, if `search_criterion="cv_L1"`, uses the L1-norm cross-validation error (Default = "AIC"). The Huber and L1-norm cross-validation errors are alternatives to the usual cross-validation L2-norm error (which the `R^2` is based on) that are more resistant to outliers, the lower the values the better.
`forward_exclude_p_bigger`	If p-value > `forward_exclude_p_bigger`, we do not consider the variable for inclusion in the forward steps (Default = .20). This is an exclusion option which purpose is skipping variables that are likely not worth looking to make the algorithm faster, especially with cross-validation. Set to 1 to prevent any exclusion here.
`backward_exclude_p_smaller`	If p-value < `backward_exclude_p_smaller`, we do not consider the variable for removal in the backward steps (Default = .01). This is an exclusion option which purpose is skipping variables that are likely not worth looking to make the algorithm faster, especially with cross-validation. Set to 0 to prevent any exclusion here.
`exclude_worse_AIC`	If AIC with variable > AIC without variable, we ignore the variable (Default = TRUE). This is an exclusion option which purpose is skipping variables that are likely not worth looking to make the algorithm faster, especially with cross-validation. Set to FALSE to prevent any exclusion here.
`max_steps`	Maximum number of steps taken (Default = 50).
`cv_iter`	Number of cross-validation iterations (Default = 5).
`cv_folds`	Number of cross-validation folds (Default = 10). Using `cv_folds=NROW(data)` will lead to leave-one-out cross-validation.
`folds`	Optional list of vectors containing the fold number for each observation. Bypass cv_iter and cv_folds. Setting your own folds could be important for certain data types like time series or longitudinal data.
`Huber_p`	Parameter controlling the Huber cross-validation error (Default = 1.345).
`classification`	Set to TRUE if you are doing classification (binary outcome).
`start_genes`	Optional starting points for genetic score (must be the same length as the number of columns of `genes`).
`start_env`	Optional starting points for environmental score (must be the same length as the number of columns of `env`).
`eps`	Threshold for convergence (.01 for quick batch simulations, .0001 for accurate results).
`maxiter`	Maximum number of iterations.
`family`	Outcome distribution and link function (Default = gaussian).
`ylim`	Optional vector containing the known min and max of the outcome variable. Even if your outcome is known to be in [a,b], if you assume a Gaussian distribution, predict() could return values outside this range. This parameter ensures that this never happens. This is not necessary with a distribution that already assumes the proper range (ex: [0,1] with binomial distribution).
`seed`	Seed for cross-validation folds.
`print`	If TRUE, print all the steps and notes/warnings. Highly recommended unless you are batch running multiple stepwise searchs. (Default=TRUE).
`remove_miss`	If TRUE, remove missing data completely, otherwise missing data is only removed when adding or dropping a variable (Default = FALSE).
`test_only`	If TRUE, only uses the first fold for training and predict the others folds; do not train on the other folds. So instead of cross-validation, this gives you train/test and you get the test R-squared as output.

Value

Returns an object of the class "LEGIT" which is list containing, in the following order: a glm fit of the main model, a glm fit of the genetic score, a glm fit of the environmental score, a list of the true model parameters (AIC, BIC, rank, df.residual, null.deviance) for which the individual model parts (main, genetic, environmental) don't estimate properly.

Examples

## Not run: 
## Continuous example
train = example_3way(250, 2.5, seed=777)
# Forward search for genes based on BIC (in interactive mode)
forward_genes_BIC = stepwise_search(train$data, genes_extra=train$G, env_original=train$E,
formula=y ~ E*G*z,search_type="forward", search="genes", search_criterion="BIC",
interactive_mode=TRUE)
# Bidirectional-backward search for environments based on cross-validation error
bidir_backward_env_cv = stepwise_search(train$data, genes_original=train$G, env_original=train$E,
formula=y ~ E*G*z,search_type="bidirectional-backward", search="env", search_criterion="cv")
## Binary example
train_bin = example_2way(500, 2.5, logit=TRUE, seed=777)
# Forward search for genes based on cross-validated AUC (in interactive mode)
forward_genes_AUC = stepwise_search(train_bin$data, genes_extra=train_bin$G, 
env_original=train_bin$E, formula=y ~ E*G,search_type="forward", search="genes", 
search_criterion="cv_AUC", classification=TRUE, family=binomial, interactive_mode=TRUE)
# Forward search for genes based on AIC
bidir_forward_genes_AIC = stepwise_search(train_bin$data, genes_extra=train_bin$G, 
env_original=train_bin$E, formula=y ~ E*G,search_type="bidirectional-forward", search="genes", 
search_criterion="AIC", classification=TRUE, family=binomial)

## End(Not run)

[Package LEGIT version 1.4.1 Index]