stepwise_search {LEGIT} | R Documentation |
Stepwise search for the best subset of genetic variants or environments with the LEGIT model
Description
[Fast, recommended for small number of variables] Adds the best variable or drops the worst variable one at a time in the genetic (if search="genes"
) or environmental score (if search="env"
). You can select the desired search criterion (AIC, BIC, cross-validation error, cross-validation AUC) to determine which variable is the best/worst and should be added/dropped. Note that when the number of variables in G and E is large, this does not generally converge to the optimal subset, this function is only recommended when you have a small number of variables (e.g. 2 environments, 6 genetic variants). If using cross-validation (search_criterion="cv"
or search_criterion="cv_AUC"
), to prevent cross-validating with each variable (extremely slow), we recommend setting a p-value threshold (p_threshold
) and forcing the algorithm not to look at models with bigger AIC (exclude_worse_AIC=TRUE
).
Usage
stepwise_search(
data,
formula,
interactive_mode = FALSE,
genes_original = NULL,
env_original = NULL,
genes_extra = NULL,
env_extra = NULL,
search_type = "bidirectional-forward",
search = "both",
search_criterion = "AIC",
forward_exclude_p_bigger = 0.2,
backward_exclude_p_smaller = 0.01,
exclude_worse_AIC = TRUE,
max_steps = 100,
cv_iter = 5,
cv_folds = 10,
folds = NULL,
Huber_p = 1.345,
classification = FALSE,
start_genes = NULL,
start_env = NULL,
eps = 0.01,
maxiter = 100,
family = gaussian,
ylim = NULL,
seed = NULL,
print = TRUE,
remove_miss = FALSE,
test_only = FALSE
)
Arguments
data |
data.frame of the dataset to be used. |
formula |
Model formula. Use E for the environmental score and G for the genetic score. Do not manually code interactions, write them in the formula instead (ex: G*E*z or G:E:z). |
interactive_mode |
If TRUE, uses interactive mode. In interactive mode, at each iteration, the user is shown the AIC, BIC, p-value and also the cross-validation |
genes_original |
data.frame of the variables inside the genetic score G (can be any sort of variable, doesn't even have to be genetic). |
env_original |
data.frame of the variables inside the environmental score E (can be any sort of variable, doesn't even have to be environmental). |
genes_extra |
data.frame of the additionnal variables to try including inside the genetic score G (can be any sort of variable, doesn't even have to be genetic). Set to NULL if using a backward search. |
env_extra |
data.frame of the variables to try including inside the environmental score E (can be any sort of variable, doesn't even have to be environmental). Set to NULL if using a backward search. |
search_type |
If |
search |
If |
search_criterion |
Criterion used to determine which variable is the best to add or worst to drop. If |
forward_exclude_p_bigger |
If p-value > |
backward_exclude_p_smaller |
If p-value < |
exclude_worse_AIC |
If AIC with variable > AIC without variable, we ignore the variable (Default = TRUE). This is an exclusion option which purpose is skipping variables that are likely not worth looking to make the algorithm faster, especially with cross-validation. Set to FALSE to prevent any exclusion here. |
max_steps |
Maximum number of steps taken (Default = 50). |
cv_iter |
Number of cross-validation iterations (Default = 5). |
cv_folds |
Number of cross-validation folds (Default = 10). Using |
folds |
Optional list of vectors containing the fold number for each observation. Bypass cv_iter and cv_folds. Setting your own folds could be important for certain data types like time series or longitudinal data. |
Huber_p |
Parameter controlling the Huber cross-validation error (Default = 1.345). |
classification |
Set to TRUE if you are doing classification (binary outcome). |
start_genes |
Optional starting points for genetic score (must be the same length as the number of columns of |
start_env |
Optional starting points for environmental score (must be the same length as the number of columns of |
eps |
Threshold for convergence (.01 for quick batch simulations, .0001 for accurate results). |
maxiter |
Maximum number of iterations. |
family |
Outcome distribution and link function (Default = gaussian). |
ylim |
Optional vector containing the known min and max of the outcome variable. Even if your outcome is known to be in [a,b], if you assume a Gaussian distribution, predict() could return values outside this range. This parameter ensures that this never happens. This is not necessary with a distribution that already assumes the proper range (ex: [0,1] with binomial distribution). |
seed |
Seed for cross-validation folds. |
print |
If TRUE, print all the steps and notes/warnings. Highly recommended unless you are batch running multiple stepwise searchs. (Default=TRUE). |
remove_miss |
If TRUE, remove missing data completely, otherwise missing data is only removed when adding or dropping a variable (Default = FALSE). |
test_only |
If TRUE, only uses the first fold for training and predict the others folds; do not train on the other folds. So instead of cross-validation, this gives you train/test and you get the test R-squared as output. |
Value
Returns an object of the class "LEGIT" which is list containing, in the following order: a glm fit of the main model, a glm fit of the genetic score, a glm fit of the environmental score, a list of the true model parameters (AIC, BIC, rank, df.residual, null.deviance) for which the individual model parts (main, genetic, environmental) don't estimate properly.
Examples
## Not run:
## Continuous example
train = example_3way(250, 2.5, seed=777)
# Forward search for genes based on BIC (in interactive mode)
forward_genes_BIC = stepwise_search(train$data, genes_extra=train$G, env_original=train$E,
formula=y ~ E*G*z,search_type="forward", search="genes", search_criterion="BIC",
interactive_mode=TRUE)
# Bidirectional-backward search for environments based on cross-validation error
bidir_backward_env_cv = stepwise_search(train$data, genes_original=train$G, env_original=train$E,
formula=y ~ E*G*z,search_type="bidirectional-backward", search="env", search_criterion="cv")
## Binary example
train_bin = example_2way(500, 2.5, logit=TRUE, seed=777)
# Forward search for genes based on cross-validated AUC (in interactive mode)
forward_genes_AUC = stepwise_search(train_bin$data, genes_extra=train_bin$G,
env_original=train_bin$E, formula=y ~ E*G,search_type="forward", search="genes",
search_criterion="cv_AUC", classification=TRUE, family=binomial, interactive_mode=TRUE)
# Forward search for genes based on AIC
bidir_forward_genes_AIC = stepwise_search(train_bin$data, genes_extra=train_bin$G,
env_original=train_bin$E, formula=y ~ E*G,search_type="bidirectional-forward", search="genes",
search_criterion="AIC", classification=TRUE, family=binomial)
## End(Not run)