R: NeRI-based selection of a linear, logistic, or Cox...

crossValidationFeatureSelection_Res {FRESA.CAD}

R Documentation

NeRI-based selection of a linear, logistic, or Cox proportional hazards regression model from a set of candidate variables

Description

This function performs a cross-validation analysis of a feature selection algorithm based on net residual improvement (NeRI) to return a predictive model. It is composed of a NeRI-based feature selection followed by an update procedure, ending with a bootstrapping backwards feature elimination. The user can control how many train and blind test sets will be evaluated.

Usage

	crossValidationFeatureSelection_Res(size = 10,
	                                    fraction = 1.0,
	                                    pvalue = 0.05,
	                                    loops = 100,
	                                    covariates = "1",
	                                    Outcome,
	                                    timeOutcome = "Time",
	                                    variableList,
	                                    data,
	                                    maxTrainModelSize = 20,
	                                    type = c("LM", "LOGIT", "COX"),
	                                    testType = c("Binomial",
	                                                 "Wilcox",
	                                                 "tStudent",
	                                                 "Ftest"),
	                                    startOffset = 0,
	                                    elimination.bootstrap.steps = 100,
	                                    trainFraction = 0.67,
	                                    trainRepetition = 9,
	                                    setIntersect = 1,
	                                    unirank = NULL,
	                                    print=TRUE,
	                                    plots=TRUE,
	                                    lambda="lambda.1se",
	                                    equivalent=FALSE,
	                                    bswimsCycles=10,
	                                    usrFitFun=NULL,
	                                    featureSize=0)

Arguments

`size`	The number of candidate variables to be tested (the first `size` variables from `variableList`)
`fraction`	The fraction of data (sampled with replacement) to be used as train
`pvalue`	The maximum p-value, associated to the NeRI, allowed for a term in the model
`loops`	The number of bootstrap loops
`covariates`	A string of the type "1 + var1 + var2" that defines which variables will always be included in the models (as covariates)
`Outcome`	The name of the column in `data` that stores the variable to be predicted by the model
`timeOutcome`	The name of the column in `data` that stores the time to event (needed only for a Cox proportional hazards regression model fitting)
`variableList`	A data frame with two columns. The first one must have the names of the candidate variables and the other one the description of such variables
`data`	A data frame where all variables are stored in different columns
`maxTrainModelSize`	Maximum number of terms that can be included in the model
`type`	Fit type: Logistic ("LOGIT"), linear ("LM"), or Cox proportional hazards ("COX")
`testType`	Type of non-parametric test to be evaluated by the `improvedResiduals` function: Binomial test ("Binomial"), Wilcoxon rank-sum test ("Wilcox"), Student's t-test ("tStudent"), or F-test ("Ftest")
`startOffset`	Only terms whose position in the model is larger than the `startOffset` are candidates to be removed
`elimination.bootstrap.steps`	The number of bootstrap loops for the backwards elimination procedure
`trainFraction`	The fraction of data (sampled with replacement) to be used as train for the cross-validation procedure
`setIntersect`	The intersect of the model (To force a zero intersect, set this value to 0)
`trainRepetition`	The number of cross-validation folds (it should be at least equal to `1/trainFraction` for a complete cross-validation)
`unirank`	A list with the results yielded by the `uniRankVar` function, required only if the rank needs to be updated during the cross-validation procedure
`print`	Logical. If `TRUE`, information will be displayed
`plots`	Logical. If `TRUE`, plots are displayed
`lambda`	The passed value to the s parameter of the glmnet cross validation coefficient
`equivalent`	Is set to TRUE CV will compute the equivalent model
`bswimsCycles`	The maximum number of models to be returned by `BSWiMS.model`
`usrFitFun`	A user fitting function to be evaluated by the cross validation procedure
`featureSize`	The original number of features to be explored in the data frame.

Details

This function produces a set of data and plots that can be used to inspect the degree of over-fitting or shrinkage of a model. It uses bootstrapped data, cross-validation data, and, if possible, retrain data.

Value

`formula.list`	A list containing objects of class `formula` with the formulas used to fit the models found at each cycle
`Models.testPrediction`	A data frame with the blind test set predictions made at each fold of the cross validation (Full B:SWiMS,Median,Bagged,Forward,Backward Elimination), where the models used to generate such predictions (`formula.list`) were generated via a feature selection process which included only the train set. It also includes a column with the `Outcome` of each prediction, and a column with the number of the fold at which the prediction was made.
`FullBSWiMS.testPrediction`	A data frame similar to `Models.testPrediction`, but where the model used to generate the predictions was the Full model, generated via a feature selection process which included all data.
`BSWiMS`	A list containing the values returned by `bootstrapVarElimination_Res` using all data and the model from `updatedforwardModel`
`forwardSelection`	A list containing the values returned by `ForwardSelection.Model.Res` using all data
`updatedforwardModel`	A list containing the values returned by `updateModel.Res` using all data and the model from `forwardSelection`
`testRMSE`	The global blind test root-mean-square error (RMSE) of the cross-validation procedure
`testPearson`	The global blind test Pearson r product-moment correlation coefficient of the cross-validation procedure
`testSpearman`	The global blind test Spearman `\rho` rank correlation coefficient of the cross-validation procedure
`FulltestRMSE`	The global blind test RMSE of the Full model
`FullTestPearson`	The global blind test Pearson r product-moment correlation coefficient of the Full model
`FullTestSpearman`	The global blind test Spearman `\rho` rank correlation coefficient of the Full model
`trainRMSE`	The train RMSE at each fold of the cross-validation procedure
`trainPearson`	The train Pearson r product-moment correlation coefficient at each fold of the cross-validation procedure
`trainSpearman`	The train Spearman `\rho` rank correlation coefficient at each fold of the cross-validation procedure
`FullTrainRMSE`	The train RMSE of the Full model at each fold of the cross-validation procedure
`FullTrainPearson`	The train Pearson r product-moment correlation coefficient of the Full model at each fold of the cross-validation procedure
`FullTrainSpearman`	The train Spearman `\rho` rank correlation coefficient of the Full model at each fold of the cross-validation procedure
`testRMSEAtFold`	The blind test RMSE at each fold of the cross-validation procedure
`FullTestRMSEAtFold`	The blind test RMSE of the Full model at each fold of the cross-validation procedure
`Fullenet`	An object of class `cv.glmnet` containing the results of an elastic net cross-validation fit
`LASSO.testPredictions`	A data frame similar to `Models.testPrediction`, but where the predictions were made by the elastic net model
`LASSOVariables`	A list with the elastic net Full model and the models found at each cross-validation fold
`byFoldTestMS`	A vector with the Mean Square error for each blind fold
`byFoldTestSpearman`	A vector with the Spearman correlation between prediction and outcome for each blind fold
`byFoldTestPearson`	A vector with the Pearson correlation between prediction and outcome for each blind fold
`byFoldCstat`	A vector with the C-index (Somers' Dxy rank correlation :`rcorr.cens`) between prediction and outcome for each blind fold
`CVBlindPearson`	A vector with the Pearson correlation between the outcome and prediction for each repeated experiment
`CVBlindSpearman`	A vector with the Spearm correlation between the outcome and prediction for each repeated experiment
`CVBlindRMS`	A vector with the RMS between the outcome and prediction for each repeated experiment
`Models.trainPrediction`	A data frame with the outcome and the train prediction of every model
`FullBSWiMS.trainPrediction`	A data frame with the outcome and the train prediction at each CV fold for the main model
`LASSO.trainPredictions`	A data frame with the outcome and the prediction of each enet lasso model
`uniTrainMSS`	A data frame with mean square of the train residuals from the univariate models of the model terms
`uniTestMSS`	A data frame with mean square of the test residuals of the univariate models of the model terms
`BSWiMS.ensemble.prediction`	The ensemble prediction by all models on the test data
`AtOptFormulas.list`	The list of formulas with "optimal" performance
`ForwardFormulas.list`	The list of formulas produced by the forward procedure
`baggFormulas.list`	The list of the bagged models
`LassoFilterVarList`	The list of variables used by LASSO fitting

Author(s)

Jose G. Tamez-Pena and Antonio Martinez-Torteya