R: IDI/NRI-based feature selection procedure for linear,...

ForwardSelection.Model.Bin {FRESA.CAD}

R Documentation

IDI/NRI-based feature selection procedure for linear, logistic, and Cox proportional hazards regression models

Description

This function performs a bootstrap sampling to rank the variables that statistically improve prediction. After the frequency rank, the function uses a forward selection procedure to create a final model, whose terms all have a significant contribution to the integrated discrimination improvement (IDI) or the net reclassification improvement (NRI). For each bootstrap, the IDI/NRI is computed and the variable with the largest statically significant IDI/NRI is added to the model. The procedure is repeated at each bootstrap until no more variables can be inserted. The variables that enter the model are then counted, and the same procedure is repeated for the rest of the bootstrap loops. The frequency of variable-inclusion in the model is returned as well as a model that uses the frequency of inclusion.

Usage

	ForwardSelection.Model.Bin(size = 100,
	                            fraction = 1,
	                            pvalue = 0.05, 
	                            loops = 100,
	                            covariates = "1",
	                            Outcome,
	                            variableList,
	                            data, 
	                            maxTrainModelSize = 20,
	                            type = c("LM", "LOGIT", "COX"),
	                            timeOutcome = "Time",
	                            selectionType=c("zIDI", "zNRI"),
	                            cores = 6,
	                            randsize = 0,
	                            featureSize=0)

Arguments

`size`	The number of candidate variables to be tested (the first `size` variables from `variableList`)
`fraction`	The fraction of data (sampled with replacement) to be used as train
`pvalue`	The maximum p-value, associated to either IDI or NRI, allowed for a term in the model
`loops`	The number of bootstrap loops
`covariates`	A string of the type "1 + var1 + var2" that defines which variables will always be included in the models (as covariates)
`Outcome`	The name of the column in `data` that stores the variable to be predicted by the model
`variableList`	A data frame with two columns. The first one must have the names of the candidate variables and the other one the description of such variables
`data`	A data frame where all variables are stored in different columns
`maxTrainModelSize`	Maximum number of terms that can be included in the model
`type`	Fit type: Logistic ("LOGIT"), linear ("LM"), or Cox proportional hazards ("COX")
`timeOutcome`	The name of the column in `data` that stores the time to event (needed only for a Cox proportional hazards regression model fitting)
`selectionType`	The type of index to be evaluated by the `improveProb` function (`Hmisc` package): z-score of IDI or of NRI
`cores`	Cores to be used for parallel processing
`randsize`	the model size of a random outcome. If randsize is less than zero. It will estimate the size
`featureSize`	The original number of features to be explored in the data frame.

Value

`final.model`	An object of class `lm`, `glm`, or `coxph` containing the final model
`var.names`	A vector with the names of the features that were included in the final model
`formula`	An object of class `formula` with the formula used to fit the final model
`ranked.var`	An array with the ranked frequencies of the features
`z.selection`	A vector in which each term represents the z-score of the index defined in `selectionType` obtained with the Full model and the model without one term
`formula.list`	A list containing objects of class `formula` with the formulas used to fit the models found at each cycle
`variableList`	A list of variables used in the forward selection

Author(s)

Jose G. Tamez-Pena and Antonio Martinez-Torteya

References

Pencina, M. J., D'Agostino, R. B., & Vasan, R. S. (2008). Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Statistics in medicine 27(2), 157-172.