R: Parallel genetic algorithm variable selection (for IMLEGIT)

genetic_var_select {LEGIT}

R Documentation

Parallel genetic algorithm variable selection (for IMLEGIT)

Description

[Very slow, recommended when the number of variables is large] Use a standard genetic algorithm with single-point crossover and a single mutation ran in parallel to find the best subset of variables. The percentage of times that each variable is included the final populations is also given. This is very computationally demanding but this finds much better solutions than either stepwise search or bootstrap variable selection.

Usage

genetic_var_select(
  data,
  formula,
  parallel_iter = 10,
  entropy_threshold = 0.1,
  popsize = 25,
  mutation_prob = 0.5,
  first_pop = NULL,
  latent_var = NULL,
  search_criterion = "AIC",
  maxgen = 100,
  eps = 0.01,
  maxiter = 100,
  family = gaussian,
  ylim = NULL,
  seed = NULL,
  progress = TRUE,
  n_cluster = 1,
  best_subsets = 5,
  cv_iter = 5,
  cv_folds = 5,
  folds = NULL,
  Huber_p = 1.345,
  classification = FALSE,
  test_only = FALSE
)

Arguments

`data`	data.frame of the dataset to be used.
`formula`	Model formula. The names of `latent_var` can be used in the formula to represent the latent variables. If names(`latent_var`) is NULL, then L1, L2, ... can be used in the formula to represent the latent variables. Do not manually code interactions, write them in the formula instead (ex: GE1E2 or G:E1:E2).
`parallel_iter`	number of parallel genetic algorithms (Default = 10). I recommend using 2-4 times the number of CPU cores used.
`entropy_threshold`	Entropy threshold for convergence of the population (Default = .10). Note that not reaching the entropy threshold just means that the population has some diversity, this is not necessarily a bad thing. Reaching the threshold is not necessary but if a population reach the threshold, we want it to stop reproducing (rather than continuing until `maxgen`) since the future generations won't change much.
`popsize`	Size of the population (Default = 25). Between 25 and 100 is generally adequate.
`mutation_prob`	Probability of mutation (Default = .50). A single variable is selected for mutation and it is mutated with probability `mutation_prob`. If the mutation causes a latent variable to become empty, no mutation is done. Using a small value (close to .05) will lead to getting more stuck in suboptimal solutions but using a large value (close to 1) will greatly increase the computing time because it will have a hard time reaching the entropy threshold.
`first_pop`	optional Starting initial population which is used instead of a fully random one. Mutation is also done on the initial population to increase variability.
`latent_var`	list of data.frame. The elements of the list are the datasets used to construct each latent variable. For interpretability and proper convergence, not using the same variable in more than one latent variable is highly recommended. It is recommended to set names to the list elements to prevent confusion because otherwise, the latent variables will be named L1, L2, ...
`search_criterion`	Criterion used to determine which variable is the best to add or worst to drop. If `search_criterion="AIC"`, uses the AIC, if `search_criterion="AICc"`, uses the AICc, if `search_criterion="BIC"`, uses the BIC, if `search_criterion="cv"`, uses the cross-validation error, if `search_criterion="cv_AUC"`, uses the cross-validated AUC, if `search_criterion="cv_Huber"`, uses the Huber cross-validation error, if `search_criterion="cv_L1"`, uses the L1-norm cross-validation error (Default = "AIC"). The Huber and L1-norm cross-validation errors are alternatives to the usual cross-validation L2-norm error (which the `R^2` is based on) that are more resistant to outliers, the lower the values the better.
`maxgen`	Maximum number of generations (iterations) of the genetic algorithm (Default = 100). Between 50 and 200 generations is generally adequate.
`eps`	Threshold for convergence (.01 for quick batch simulations, .0001 for accurate results). Note that using .001 rather than .01 (default) can more than double or triple the computing time of genetic_var_select.
`maxiter`	Maximum number of iterations.
`family`	Outcome distribution and link function (Default = gaussian).
`ylim`	Optional vector containing the known min and max of the outcome variable. Even if your outcome is known to be in [a,b], if you assume a Gaussian distribution, predict() could return values outside this range. This parameter ensures that this never happens. This is not necessary with a distribution that already assumes the proper range (ex: [0,1] with binomial distribution).
`seed`	Optional seed.
`progress`	If TRUE, shows the progress done (Default=TRUE).
`n_cluster`	Number of parallel clusters, I recommend using the number of CPU cores - 1 (Default = 1).
`best_subsets`	If `best_subsets = k`, the output will show the k best subsets of variables (Default = 5)
`cv_iter`	Number of cross-validation iterations (Default = 5).
`cv_folds`	Number of cross-validation folds (Default = 10). Using `cv_folds=NROW(data)` will lead to leave-one-out cross-validation.
`folds`	Optional list of vectors containing the fold number for each observation. Bypass cv_iter and cv_folds. Setting your own folds could be important for certain data types like time series or longitudinal data.
`Huber_p`	Parameter controlling the Huber cross-validation error (Default = 1.345).
`classification`	Set to TRUE if you are doing classification and cross-validation (binary outcome).
`test_only`	If TRUE, only uses the first fold for training and predict the others folds; do not train on the other folds. So instead of cross-validation, this gives you train/test and you get the test R-squared as output.

Value

Returns a list of vectors containing the percentage of times that each variable was included in the final populations, the criterion of the best k models, the starting points of the best k models (with the names of the best variables) and the entropy of the populations.

References

Mu Zhu, & Hugh Chipman. Darwinian evolution in parallel universes: A parallel genetic algorithm for variable selection (2006). Technometrics, 48(4), 491-502.

Examples

## Not run: 
## Example
train = example_3way_3latent(250, 2, seed=777)
# Genetic algorithm based on BIC
# Normally you should use a lot more than 2 populations with 10 generations
ga = genetic_var_select(train$data, latent_var=train$latent_var,
formula=y ~ E*G*Z, search_criterion="AIC", parallel_iter=2, maxgen = 10)

## End(Not run)

[Package LEGIT version 1.4.1 Index]