genetic_var_select {LEGIT} | R Documentation |
Parallel genetic algorithm variable selection (for IMLEGIT)
Description
[Very slow, recommended when the number of variables is large] Use a standard genetic algorithm with single-point crossover and a single mutation ran in parallel to find the best subset of variables. The percentage of times that each variable is included the final populations is also given. This is very computationally demanding but this finds much better solutions than either stepwise search or bootstrap variable selection.
Usage
genetic_var_select(
data,
formula,
parallel_iter = 10,
entropy_threshold = 0.1,
popsize = 25,
mutation_prob = 0.5,
first_pop = NULL,
latent_var = NULL,
search_criterion = "AIC",
maxgen = 100,
eps = 0.01,
maxiter = 100,
family = gaussian,
ylim = NULL,
seed = NULL,
progress = TRUE,
n_cluster = 1,
best_subsets = 5,
cv_iter = 5,
cv_folds = 5,
folds = NULL,
Huber_p = 1.345,
classification = FALSE,
test_only = FALSE
)
Arguments
data |
data.frame of the dataset to be used. |
formula |
Model formula. The names of |
parallel_iter |
number of parallel genetic algorithms (Default = 10). I recommend using 2-4 times the number of CPU cores used. |
entropy_threshold |
Entropy threshold for convergence of the population (Default = .10). Note that not reaching the entropy threshold just means that the population has some diversity, this is not necessarily a bad thing. Reaching the threshold is not necessary but if a population reach the threshold, we want it to stop reproducing (rather than continuing until |
popsize |
Size of the population (Default = 25). Between 25 and 100 is generally adequate. |
mutation_prob |
Probability of mutation (Default = .50). A single variable is selected for mutation and it is mutated with probability |
first_pop |
optional Starting initial population which is used instead of a fully random one. Mutation is also done on the initial population to increase variability. |
latent_var |
list of data.frame. The elements of the list are the datasets used to construct each latent variable. For interpretability and proper convergence, not using the same variable in more than one latent variable is highly recommended. It is recommended to set names to the list elements to prevent confusion because otherwise, the latent variables will be named L1, L2, ... |
search_criterion |
Criterion used to determine which variable is the best to add or worst to drop. If |
maxgen |
Maximum number of generations (iterations) of the genetic algorithm (Default = 100). Between 50 and 200 generations is generally adequate. |
eps |
Threshold for convergence (.01 for quick batch simulations, .0001 for accurate results). Note that using .001 rather than .01 (default) can more than double or triple the computing time of genetic_var_select. |
maxiter |
Maximum number of iterations. |
family |
Outcome distribution and link function (Default = gaussian). |
ylim |
Optional vector containing the known min and max of the outcome variable. Even if your outcome is known to be in [a,b], if you assume a Gaussian distribution, predict() could return values outside this range. This parameter ensures that this never happens. This is not necessary with a distribution that already assumes the proper range (ex: [0,1] with binomial distribution). |
seed |
Optional seed. |
progress |
If TRUE, shows the progress done (Default=TRUE). |
n_cluster |
Number of parallel clusters, I recommend using the number of CPU cores - 1 (Default = 1). |
best_subsets |
If |
cv_iter |
Number of cross-validation iterations (Default = 5). |
cv_folds |
Number of cross-validation folds (Default = 10). Using |
folds |
Optional list of vectors containing the fold number for each observation. Bypass cv_iter and cv_folds. Setting your own folds could be important for certain data types like time series or longitudinal data. |
Huber_p |
Parameter controlling the Huber cross-validation error (Default = 1.345). |
classification |
Set to TRUE if you are doing classification and cross-validation (binary outcome). |
test_only |
If TRUE, only uses the first fold for training and predict the others folds; do not train on the other folds. So instead of cross-validation, this gives you train/test and you get the test R-squared as output. |
Value
Returns a list of vectors containing the percentage of times that each variable was included in the final populations, the criterion of the best k models, the starting points of the best k models (with the names of the best variables) and the entropy of the populations.
References
Mu Zhu, & Hugh Chipman. Darwinian evolution in parallel universes: A parallel genetic algorithm for variable selection (2006). Technometrics, 48(4), 491-502.
Examples
## Not run:
## Example
train = example_3way_3latent(250, 2, seed=777)
# Genetic algorithm based on BIC
# Normally you should use a lot more than 2 populations with 10 generations
ga = genetic_var_select(train$data, latent_var=train$latent_var,
formula=y ~ E*G*Z, search_criterion="AIC", parallel_iter=2, maxgen = 10)
## End(Not run)