R: Selecting a subset of 'q' variables

selection {FWDselect}

R Documentation

Selecting a subset of `q` variables

Description

Main function for selecting the best subset of q variables. Note that the selection procedure can be used with lm, glm or gam functions.

Usage

selection(x, y, q, prevar = NULL, criterion = "deviance", method = "lm",
  family = "gaussian", seconds = FALSE, nmodels = 1, nfolds = 5,
  cluster = TRUE, ncores = NULL)

Arguments

`x`	A data frame containing all the covariates.
`y`	A vector with the response values.
`q`	An integer specifying the size of the subset of variables to be selected.
`prevar`	A vector containing the number of the best subset of `q-1` variables. `NULL`, by default.
`criterion`	The information criterion to be used. Default is the deviance. Other functions provided are the coefficient of determination (`"R2"`), the residual variance (`"variance"`), the Akaike information criterion (`"aic"`), AIC with a correction for finite sample sizes (`"aicc"`) and the Bayesian information criterion (`"bic"`). The deviance, coefficient of determination and variance are calculated by cross-validation.
`method`	A character string specifying which regression method is used, i.e., linear models (`"lm"`), generalized additive models (`"glm"`) or generalized additive models (`"gam"`).
`family`	A description of the error distribution and link function to be used in the model: (`"gaussian"`), (`"binomial"`) or (`"poisson"`).
`seconds`	A logical value. By default, `FALSE`. If `TRUE` then, rather than returning the single best model only, the function returns a few of the best models (equivalent).
`nmodels`	Number of secondary models to be returned.
`nfolds`	Number of folds for the cross-validation procedure, for `deviance`, `R2` or `variance` criterion.
`cluster`	A logical value. If `TRUE` (default), the procedure is parallelized. Note that there are cases without enough repetitions (e.g., a low number of initial variables) that R will gain in performance through serial computation. R takes time to distribute tasks across the processors also it will need time for binding them all together later on. Therefore, if the time for distributing and gathering pieces together is greater than the time need for single-thread computing, it does not worth parallelize.
`ncores`	An integer value specifying the number of cores to be used in the parallelized procedure. If `NULL` (default), the number of cores to be used is equal to the number of cores of the machine - 1.

Value

`Best model`	The best model. If `seconds=TRUE`, it returns also the best alternative models.
`Variable name`	Names of the variable.
`Variable number`	Number of the variables.
`Information criterion`	Information criterion used and its value.
`Prediction`	The prediction of the best model.

Author(s)

Marta Sestelo, Nora M. Villanueva and Javier Roca-Pardinas.

Examples

library(FWDselect)
data(diabetes)
x = diabetes[ ,2:11]
y = diabetes[ ,1]
obj1 = selection(x, y, q = 1, method = "lm", criterion = "variance", cluster = FALSE)
obj1

# second models
obj11 = selection(x, y, q = 1, method = "lm", criterion = "variance",
seconds = TRUE, nmodels = 2, cluster = FALSE)
obj11

# prevar argument
obj2 = selection(x, y, q = 2, method = "lm", criterion = "variance", cluster = FALSE)
obj2
obj3 = selection(x, y, q = 3, prevar = obj2$Variable_numbers,
method = "lm", criterion = "variance", cluster = FALSE)

[Package FWDselect version 2.1.0 Index]