autoModel {Conigrave}R Documentation



autoModel uses a genetic algorithm to optimize regression models for increased explained variance. Overly complicated models are penalized for adding additional regression terms in order to combat over-fitting.


autoModel(data, outcome, genepool = NULL, extinction = 30,
  children = 20, penalty = 0.03, samples = 5, include = c(),
  exclude = c(), set.seed = NULL)



a data.frame or imputationList.


the colname of the dependent variable.


a vector. The genepool is the vector of variables names which shall be used to generate models. If not set, the genepool defaults to all variables in the supplied dataset other than the outcome variable.


a numeric. The algorithm will stop when no improvement has been made for this number of generations.


a numeric. The number of models to test in each generation.


a numeric. Model fitness will be reduced by this number for each regression coefficient. This results in a handicap for overly complicated models.


a numeric. The number of sub-samples in which to test stability of r-squared.


a vector of colnames which must be included as predictors in each model.


a vector of colnames to be removed from the genepool.


a numeric. If this argument is provided, the algorithm will use the given seed in order to present reproducible results.


'autoModel' is a genetic algorithm which mutates regression models (predicting a specified outcome) in order to maximize r-squared (the explained variance).

The algorithm tests models at random. In each generation, it produces 'children' using the current best model as a seed. Each child of the previous winner will, on average, lose and gain a predictor. In each child, predictors have a smaller chance to gain or lose an interaction term. Over successive generations selecting seeds with larger r-squares causes a drift towards models which explain more variance.

Without intervention this algorithm generates very complicated models, e.g. 15 way interactions, in which all variance is explained. These overly-complicated models are almost certainly useless for explaining phenomenon outside of the training dataset. Generally, these models do no more than describe the exact configuration of the dataset in which they evolved. In order to deal with this situation, models are penalized for every predictor. This means that increased complexity will not be preferred unless it contributes substantially to the model's r-squared.

When the algorithm has failed to improve model fitness over many successive generations, it stops and returns the best model. It also presents the history of all previous winners. The algorithm tests the stability of each of these winners on multiple sub-samples (75% of rows with replacement). Stability is equal to 1, minus the standard deviation of the r-squares in each sub sample, divided by the r-square statistic of the model in question. Stability can range from 1 to negative values (if the standard deviation of sub-sample r-squares was larger than the model's r-squared).


A list containing a tibble with all the best models the algorithm found, the summary results of the best model, and a plot tracking the algorithms' performance.


autoModel(mtcars,"mpg",set.seed = 2)

[Package Conigrave version 0.4.4 Index]