stepwise_search_IM {LEGIT} | R Documentation |
Stepwise search for the best subset of elements in the latent variables with the IMLEGIT model
Description
[Fast, recommended when the number of variables is small] Adds the best variable or drops the worst variable one at a time in the latent variables. You can select the desired search criterion (AIC, BIC, cross-validation error, cross-validation AUC) to determine which variable is the best/worst and should be added/dropped. Note that when the number of variables in G and E is large, this does not generally converge to the optimal subset, this function is only recommended when you have a small number of variables (e.g. 2 environments, 6 genetic variants). If using cross-validation (search_criterion="cv"
or search_criterion="cv_AUC"
), to prevent cross-validating with each variable (extremely slow), we recommend setting a p-value threshold (p_threshold
) and forcing the algorithm not to look at models with bigger AIC (exclude_worse_AIC=TRUE
).
Usage
stepwise_search_IM(
data,
formula,
interactive_mode = FALSE,
latent_var_original = NULL,
latent_var_extra = NULL,
search_type = "bidirectional-forward",
search = 0,
search_criterion = "AIC",
forward_exclude_p_bigger = 0.2,
backward_exclude_p_smaller = 0.01,
exclude_worse_AIC = TRUE,
max_steps = 100,
cv_iter = 5,
cv_folds = 10,
folds = NULL,
Huber_p = 1.345,
classification = FALSE,
start_latent_var = NULL,
eps = 0.01,
maxiter = 100,
family = gaussian,
ylim = NULL,
seed = NULL,
print = TRUE,
remove_miss = FALSE,
test_only = FALSE
)
Arguments
data |
data.frame of the dataset to be used. |
formula |
Model formula. The names of |
interactive_mode |
If TRUE, uses interactive mode. In interactive mode, at each iteration, the user is shown the AIC, BIC, p-value and also the cross-validation |
latent_var_original |
list of data.frame. The elements of the list are the datasets used to construct each latent variable. For interpretability and proper convergence, not using the same variable in more than one latent variable is highly recommended. It is recommended to set names to the list elements to prevent confusion because otherwise, the latent variables will be named L1, L2, ... |
latent_var_extra |
list of data.frame (with the same structure as latent_var_original) containing the additionnal elements to try including inside the latent variables. Set to NULL if using a backward search. |
search_type |
If |
search |
If |
search_criterion |
Criterion used to determine which variable is the best to add or worst to drop. If |
forward_exclude_p_bigger |
If p-value > |
backward_exclude_p_smaller |
If p-value < |
exclude_worse_AIC |
If AIC with variable > AIC without variable, we ignore the variable (Default = TRUE). This is an exclusion option which purpose is skipping variables that are likely not worth looking to make the algorithm faster, especially with cross-validation. Set to FALSE to prevent any exclusion here. |
max_steps |
Maximum number of steps taken (Default = 50). |
cv_iter |
Number of cross-validation iterations (Default = 5). |
cv_folds |
Number of cross-validation folds (Default = 10). Using |
folds |
Optional list of vectors containing the fold number for each observation. Bypass cv_iter and cv_folds. Setting your own folds could be important for certain data types like time series or longitudinal data. |
Huber_p |
Parameter controlling the Huber cross-validation error (Default = 1.345). |
classification |
Set to TRUE if you are doing classification (binary outcome). |
start_latent_var |
Optional list of starting points for each latent variable (The list must have the same length as the number of latent variables and each element of the list must have the same length as the number of variables of the corresponding latent variable). |
eps |
Threshold for convergence (.01 for quick batch simulations, .0001 for accurate results). |
maxiter |
Maximum number of iterations. |
family |
Outcome distribution and link function (Default = gaussian). |
ylim |
Optional vector containing the known min and max of the outcome variable. Even if your outcome is known to be in [a,b], if you assume a Gaussian distribution, predict() could return values outside this range. This parameter ensures that this never happens. This is not necessary with a distribution that already assumes the proper range (ex: [0,1] with binomial distribution). |
seed |
Seed for cross-validation folds. |
print |
If TRUE, print all the steps and notes/warnings. Highly recommended unless you are batch running multiple stepwise searchs. (Default=TRUE). |
remove_miss |
If TRUE, remove missing data completely, otherwise missing data is only removed when adding or dropping a variable (Default = FALSE). |
test_only |
If TRUE, only uses the first fold for training and predict the others folds; do not train on the other folds. So instead of cross-validation, this gives you train/test and you get the test R-squared as output. |
Value
Returns an object of the class "IMLEGIT" which is list containing, in the following order: a glm fit of the main model, a list of the glm fits of the latent variables and a list of the true model parameters (AIC, BIC, rank, df.residual, null.deviance) for which the individual model parts (main, genetic, environmental) don't estimate properly.
Examples
## Not run:
## Example
train = example_3way_3latent(250, 1, seed=777)
# Forward search for genes based on BIC (in interactive mode)
forward_genes_BIC = stepwise_search_IM(train$data,
latent_var_original=list(G=NULL, E=train$latent_var$E, Z=train$latent_var$Z),
latent_var_extra=list(G=train$latent_var$G,E=NULL,Z=NULL),
formula=y ~ E*G*Z,search_type="forward", search=1, search_criterion="BIC",
interactive_mode=TRUE)
# Bidirectional-backward search for everything based on AIC
bidir_backward_AIC = stepwise_search_IM(train$data, latent_var_extra=NULL,
latent_var_original=train$latent_var,
formula=y ~ E*G*Z,search_type="bidirectional-backward", search=0, search_criterion="AIC")
## End(Not run)