ExhaustiveSearch {ExhaustiveSearch} | R Documentation |
Exhaustive feature selection
Description
Performs an exhaustive feature selection. ExhaustiveSearch()
is a fast and
scalable implementation of an exhaustive feature selection framework. It is
particularly suited for huge tasks, which would typically not be possible due
to memory limitations. The current version allows to compute linear and
logistic regression models and compare them with respect to AIC or MSE.
Usage
ExhaustiveSearch(
formula,
data,
family = NULL,
performanceMeasure = NULL,
combsUpTo = NULL,
nResults = 5000,
nThreads = NULL,
testSetIDs = NULL,
errorVal = -1,
quietly = FALSE,
checkLarge = TRUE
)
Arguments
formula |
An object of class formula (or one that can be coerced to that class): a symbolic description of the feature and response structure. All combinations of features on the right hand side of the formula are evaluated. |
data |
A data.frame (or object coercible by |
family |
A character string naming the family function similar to the
parameter in |
performanceMeasure |
A character string naming the performance measure to compare models by. Currently available options are 'AIC' (Akaike's An Information Criterion) or 'MSE' (Mean Squared Error). |
combsUpTo |
An integer of length 1 to set an upper limit to the number of features in a combination. This can be useful to drastically reduce the total number of combinations to a feasible size. |
nResults |
An integer of length 1 to define the size of the final
ranking list. The default (5000) provides a good trade-off of memory usage
and result size. Set this value to |
nThreads |
Number of threads to use. The default is to detect the available number of threads automatically. |
testSetIDs |
A vector of row indices of data, which define the test set
partition. If this parameter is |
errorVal |
A numeric value defining what performance result is returned if the model could not be fitted. The default (-1) makes those models appear at the top of the result ranking. |
quietly |
logical. If set to |
checkLarge |
logical. Very large calls get stopped by a safety net. This parameter can be used to execute these calls anyway. |
Details
An exhaustive search evaluates all setups of a combinatorial task. In feature
and model selection application, exhaustive searches are often referred to as
optimal search strategies, as they test each setup and therefore ensure to
find the best solution. The main downside of this approach is the possibly
enormous computational complexity of the task. ExhaustiveSearch()
provides
an easy to use and efficient framework for these tasks.
Its main characteristics are:
Combinations are iteratively generated on the fly,
Model fitting and evalution is performed multi-threaded in C++,
Only a fixed amount of models are stored to keep memory usage small.
Therefore, the framework of this package is able to evaluate huge tasks of billions of models, while only being limited by run-time.
Currently, ordinary linear regression models similar to lm()
and logistic
regression models similar to glm()
(with parameter family = "binomial"
)
can be fitted. The model type is specified via the family
parameter. All
model results of the C++ backend are identical to what would be obtained by
glm()
or lm()
. For that, the logistic regression also uses the same
L-BFGS optimizer
as glm()
.
To assess the quality of a model, the performanceMeasure
options 'AIC'
(Akaike's An Information Criterion) and 'MSE' (Mean Squared Error) are
implemented. Note that the AIC can only be computed on the training data,
while it is recommended for the MSE to be computed on independent test data.
If performanceMeasure
is not set, it will be decided according to the
definition of a test data set.
While this framework is able to handle very large amounts of combinations, an
exhaustive search of every theoretical combination can still be unfeasible.
However, a possible way to drastically limit the total number of combinations
is to define an upper bound for the size of a combination. For example,
evaluating all combinations of 500 features (3.3e150) is obviously
impossible. But if we only consider combinations of up to 3 features, this
number reduces to around 21 million, which could easily be evaluated by this
framework in less than a minute (16 threads). Setting an upper limit is thus
a very powerful option to enable high dimensional analyses. It is implemented
by the parameter combsUpTo
.
A core element of why this framework does not require more memory if tasks
get larger is that at any point the best models are stored in a list of
fixed size. Therefore, sub-optimal models are not saved and do not take space
and time to be handled. The parameter defining the size of the models, which
are actively stored is nResults
. Large values here can impair performance
or even cause errors, if the system memory runs out and should always be set
with care. The function will however warn you beforehand if you set a very
large value here.
The parameter testSetIDs
can be used to split the data into a training and
testing partition. If it is not set, all models will be trained and tested on
the full data set. If it is set, the data will be split beforehand into
data[testSetIDs,]
and data[-testSetIDs,]
.
The development version of this package can be found at https://github.com/RudolfJagdhuber/ExhaustiveSearch. Issues or requests are handled on this page.
Value
Object of class ExhaustiveSearch
with elements
nModels |
The total number of evaluated models. |
runtimeSec |
The total runtime of the exhaustive search in seconds. |
ranking |
A list of the performance values and the featureIDs. The
i-th element of both correspond. The featureIDs refer to the elements of
|
featureNames |
The feature names in the given data.
|
batchInfo |
A list of information on the batches, into which the total task has been partitioned. List elements are the number of batches, the number of elements per batch, and the combination boundaries that define the batches. |
setup |
A list of input parameters from the function call. |
Author(s)
Rudolf Jagdhuber
See Also
Examples
## Linear Regression on mtcars data
data(mtcars)
## Exhaustive search of 1023 models compared by AIC
ES <- ExhaustiveSearch(mpg ~ ., data = mtcars, family = "gaussian",
performanceMeasure = "AIC")
print(ES)
## Same setup, but compared by MSE on a test set partition
testIDs <- sample(nrow(mtcars), round(1/3 * nrow(mtcars)))
ES2 <- ExhaustiveSearch(mpg ~ ., data = mtcars, family = "gaussian",
performanceMeasure = "MSE", testSetIDs = testIDs)
print(ES2)
## Not run:
## Logistic Regression on Ionosphere Data
data("Ionosphere", package = "mlbench")
## Only combinations of up to 3 features! -> 5488 models instead of 4 billion
ES3 <- ExhaustiveSearch((Class == "good") ~ ., data = Ionosphere[,-c(1, 2)],
family = "binomial", combsUpTo = 3)
print(ES3)
## End(Not run)