crossValidationFeatureSelection_Bin {FRESA.CAD} | R Documentation |
IDI/NRI-based selection of a linear, logistic, or Cox proportional hazards regression model from a set of candidate variables
Description
This function performs a cross-validation analysis of a feature selection algorithm based on the integrated discrimination improvement (IDI) or the net reclassification improvement (NRI) to return a predictive model. It is composed of an IDI/NRI-based feature selection followed by an update procedure, ending with a bootstrapping backwards feature elimination. The user can control how many train and blind test sets will be evaluated.
Usage
crossValidationFeatureSelection_Bin(size = 10,
fraction = 1.0,
pvalue = 0.05,
loops = 100,
covariates = "1",
Outcome,
timeOutcome = "Time",
variableList,
data,
maxTrainModelSize = 20,
type = c("LM", "LOGIT", "COX"),
selectionType = c("zIDI", "zNRI"),
startOffset = 0,
elimination.bootstrap.steps = 100,
trainFraction = 0.67,
trainRepetition = 9,
bootstrap.steps = 100,
nk = 0,
unirank = NULL,
print=TRUE,
plots=TRUE,
lambda="lambda.1se",
equivalent=FALSE,
bswimsCycles=10,
usrFitFun=NULL,
featureSize=0)
Arguments
size |
The number of candidate variables to be tested (the first |
fraction |
The fraction of data (sampled with replacement) to be used as train |
pvalue |
The maximum p-value, associated to either IDI or NRI, allowed for a term in the model |
loops |
The number of bootstrap loops |
covariates |
A string of the type "1 + var1 + var2" that defines which variables will always be included in the models (as covariates) |
Outcome |
The name of the column in |
timeOutcome |
The name of the column in |
variableList |
A data frame with two columns. The first one must have the names of the candidate variables and the other one the description of such variables |
data |
A data frame where all variables are stored in different columns |
maxTrainModelSize |
Maximum number of terms that can be included in the model |
type |
Fit type: Logistic ("LOGIT"), linear ("LM"), or Cox proportional hazards ("COX") |
selectionType |
The type of index to be evaluated by the |
startOffset |
Only terms whose position in the model is larger than the |
elimination.bootstrap.steps |
The number of bootstrap loops for the backwards elimination procedure |
trainFraction |
The fraction of data (sampled with replacement) to be used as train for the cross-validation procedure |
trainRepetition |
The number of cross-validation folds (it should be at least equal to |
bootstrap.steps |
The number of bootstrap loops for the confidence intervals estimation |
nk |
The number of neighbours used to generate a k-nearest neighbours (KNN) classification. If zero, k is set to the square root of the number of cases. If less than zero, it will not perform the KNN classification |
unirank |
A list with the results yielded by the |
print |
Logical. If |
plots |
Logical. If |
lambda |
The passed value to the s parameter of the glmnet cross validation coefficient |
equivalent |
Is set to TRUE CV will compute the equivalent model |
bswimsCycles |
The maximum number of models to be returned by |
usrFitFun |
A user fitting function to be evaluated by the cross validation procedure |
featureSize |
The original number of features to be explored in the data frame. |
Details
This function produces a set of data and plots that can be used to inspect the degree of over-fitting or shrinkage of a model. It uses bootstrapped data, cross-validation data, and, if possible, retrain data. During each cycle, a train and a test ROC will be generated using bootstrapped data. At the end of the cross-validation feature selection procedure, a set of three plots may be produced depending on the specifications of the analysis. The first plot shows the ROC for each cross-validation blind test. The second plot, if enough samples are given, shows the ROC of each model trained and tested in the blind test partition. The final plot shows ROC curves generated with the train, the bootstrapped blind test, and the cross-validation test data. Additionally, this plot will also contain the ROC of the cross-validation mean test data, and of the cross-validation coherence. These set of plots may be used to get an overall perspective of the expected model shrinkage. Along with the plots, the function provides the overall performance of the system (accuracy, sensitivity, and specificity). The function also produces a report of the expected performance of a KNN algorithm trained with the selected features of the model, and an elastic net algorithm. The test predictions obtained with these algorithms can then be compared to the predictions generated by the logistic, linear, or Cox proportional hazards regression model.
Value
formula.list |
A list containing objects of class |
Models.testPrediction |
A data frame with the blind test set predictions (Full B:SWiMS,Median,Bagged,Forward,Backwards Eliminations) made at each fold of the cross validation, where the models used to generate such predictions ( |
FullBSWiMS.testPrediction |
A data frame similar to |
TestRetrained.blindPredictions |
A data frame similar to |
LastTrainBSWiMS.bootstrapped |
An object of class |
Test.accuracy |
The global blind test accuracy of the cross-validation procedure |
Test.sensitivity |
The global blind test sensitivity of the cross-validation procedure |
Test.specificity |
The global blind test specificity of the cross-validation procedure |
Train.correlationsToFull |
The Spearman |
Blind.correlationsToFull |
The Spearman |
FullModelAtFoldAccuracies |
The blind test accuracy for the Full model at each cross-validation fold |
FullModelAtFoldSpecificties |
The blind test specificity for the Full model at each cross-validation fold |
FullModelAtFoldSensitivities |
The blind test sensitivity for the Full model at each cross-validation fold |
FullModelAtFoldAUC |
The blind test ROC AUC for the Full model at each cross-validation fold |
AtCVFoldModelBlindAccuracies |
The blind test accuracy for the Full model at each final cross-validation fold |
AtCVFoldModelBlindSpecificities |
The blind test specificity for the Full model at each final cross-validation fold |
AtCVFoldModelBlindSensitivities |
The blind test sensitivity for the Full model at each final cross-validation fold |
CVTrain.Accuracies |
The train accuracies at each fold |
CVTrain.Sensitivity |
The train sensitivity at each fold |
CVTrain.Specificity |
The train specificity at each fold |
CVTrain.AUCs |
The train ROC AUC for each fold |
forwardSelection |
A list containing the values returned by |
updateforwardSelection |
A list containing the values returned by |
BSWiMS |
A list containing the values returned by |
FullBSWiMS.bootstrapped |
An object of class |
Models.testSensitivities |
A matrix with the mean ROC sensitivities at certain specificities for each train and all test cross-validation folds using the cross-validation models (i.e. 0.95, 0.90, 0.80, 0.70, 0.60, 0.50, 0.40, 0.30, 0.20, 0.10, and 0.05) |
FullKNN.testPrediction |
A data frame similar to |
KNN.testPrediction |
A data frame similar to |
Fullenet |
An object of class |
LASSO.testPredictions |
A data frame similar to |
LASSOVariables |
A list with the elastic net Full model and the models found at each cross-validation fold |
uniTrain.Accuracies |
The list of accuracies of an univariate analysis on each one of the model variables in the train sets |
uniTest.Accuracies |
The list of accuracies of an univariate analysis on each one of the model variables in the test sets |
uniTest.TopCoherence |
The accuracy coherence of the top ranked variable on the test set |
uniTrain.TopCoherence |
The accuracy coherence of the top ranked variable on the train set |
Models.trainPrediction |
A data frame with the outcome and the train prediction of every model |
FullBSWiMS.trainPrediction |
A data frame with the outcome and the train prediction at each CV fold for the main model |
LASSO.trainPredictions |
A data frame with the outcome and the prediction of each enet lasso model |
BSWiMS.ensemble.prediction |
The ensemble prediction by all models on the test data |
AtOptFormulas.list |
The list of formulas with "optimal" performance |
ForwardFormulas.list |
The list of formulas produced by the forward procedure |
baggFormulas.list |
The list of the bagged models |
LassoFilterVarList |
The list of variables used by LASSO fitting |
Author(s)
Jose G. Tamez-Pena and Antonio Martinez-Torteya
References
Pencina, M. J., D'Agostino, R. B., & Vasan, R. S. (2008). Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Statistics in medicine 27(2), 157-172.
See Also
crossValidationFeatureSelection_Res,
ForwardSelection.Model.Bin,
ForwardSelection.Model.Res