FRESA.Model {FRESA.CAD} | R Documentation |
Automated model selection
Description
This function uses a wrapper procedure to select the best features of a non-penalized linear model that best predict the outcome, given the formula of an initial model template (linear, logistic, or Cox proportional hazards), an optimization procedure, and a data frame. A filter scheme may be enabled to reduce the search space of the wrapper procedure. The false selection rate may be empirically controlled by enabling bootstrapping, and model shrinkage can be evaluated by cross-validation.
Usage
FRESA.Model(formula,
data,
OptType = c("Binary", "Residual"),
pvalue = 0.05,
filter.p.value = 0.10,
loops = 32,
maxTrainModelSize = 20,
elimination.bootstrap.steps = 100,
bootstrap.steps = 100,
print = FALSE,
plots = FALSE,
CVfolds = 1,
repeats = 1,
nk = 0,
categorizationType = c("Raw",
"Categorical",
"ZCategorical",
"RawZCategorical",
"RawTail",
"RawZTail",
"Tail",
"RawRaw"),
cateGroups = c(0.1, 0.9),
raw.dataFrame = NULL,
var.description = NULL,
testType = c("zIDI",
"zNRI",
"Binomial",
"Wilcox",
"tStudent",
"Ftest"),
lambda="lambda.1se",
equivalent=FALSE,
bswimsCycles=20,
usrFitFun=NULL
)
Arguments
formula |
An object of class |
data |
A data frame where all variables are stored in different columns |
OptType |
Optimization type: Based on the integrated discrimination improvement (Binary) index for binary classification ("Binary"), or based on the net residual improvement (NeRI) index for linear regression ("Residual") |
pvalue |
The maximum p-value, associated to the |
filter.p.value |
The maximum p-value, for a variable to be included to the feature selection procedure |
loops |
The number of bootstrap loops for the forward selection procedure |
maxTrainModelSize |
Maximum number of terms that can be included in the model |
elimination.bootstrap.steps |
The number of bootstrap loops for the backwards elimination procedure |
bootstrap.steps |
The number of bootstrap loops for the bootstrap validation procedure |
print |
Logical. If |
plots |
Logical. If |
CVfolds |
The number of folds for the final cross-validation |
repeats |
The number of times that the cross-validation procedure will be repeated |
nk |
The number of neighbors used to generate a k-nearest neighbors (KNN) classification. If zero, k is set to the square root of the number of cases. If less than zero, it will not perform the KNN classification |
categorizationType |
How variables will be analyzed: As given in |
cateGroups |
A vector of percentiles to be used for the categorization procedure |
raw.dataFrame |
A data frame similar to |
var.description |
A vector of the same length as the number of columns of data, containing a description of the variables |
testType |
For an Binary-based optimization, the type of index to be evaluated by the |
lambda |
The passed value to the s parameter of the glmnet cross validation coefficient |
equivalent |
Is set to TRUE CV will compute the equivalent model |
bswimsCycles |
The maximum number of models to be returned by |
usrFitFun |
An optional user provided fitting function to be evaluated by the cross validation procedure: fitting: usrFitFun(formula,data), with a predict function |
Details
This important function of FRESA.CAD will model or cross validate the models. Given an outcome formula, and a data.frame this function will do an univariate analysis of the data (univariateRankVariables
),
then it will select the top ranked variables; after that it will select the model that best describes the outcome. At output it will return the bootstrapped performance of the model
(bootstrapValidation_Bin
or bootstrapValidation_Res
). It can be set to report the cross-validation performance of the selection process which will return either
a crossValidationFeatureSelection_Bin
or a crossValidationFeatureSelection_Res
object.
Value
BSWiMS.model |
An object of class |
reducedModel |
The resulting object of the backward elimination procedure |
univariateAnalysis |
A data frame with the results from the univariate analysis |
forwardModel |
The resulting object of the feature selection function. |
updatedforwardModel |
The resulting object of the the update procedure |
bootstrappedModel |
The resulting object of the bootstrap procedure on |
cvObject |
The resulting object of the cross-validation procedure |
used.variables |
The number of terms that passed the filter procedure |
call |
the function call |
Author(s)
Jose G. Tamez-Pena and Antonio Martinez-Torteya
References
Pencina, M. J., D'Agostino, R. B., & Vasan, R. S. (2008). Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Statistics in medicine 27(2), 157-172.
Examples
## Not run:
# Start the graphics device driver to save all plots in a pdf format
pdf(file = "FRESA.Model.Example.pdf",width = 8, height = 6)
# Get the stage C prostate cancer data from the rpart package
data(stagec,package = "rpart")
options(na.action = 'na.pass')
stagec_mat <- cbind(pgstat = stagec$pgstat,
pgtime = stagec$pgtime,
as.data.frame(model.matrix(Surv(pgtime,pgstat) ~ .,stagec))[-1])
data(cancerVarNames)
dataCancerImputed <- nearestNeighborImpute(stagec_mat)
# Get a Cox proportional hazards model using:
# - The default parameters
md <- FRESA.Model(formula = Surv(pgtime, pgstat) ~ 1,
data = dataCancerImputed,
var.description = cancerVarNames[,2])
pt <- plot(md$bootstrappedModel)
sm <- summary(md$BSWiMS.model)
print(sm$coefficients)
# Get a 10 fold CV Cox proportional hazards model using:
# - Repeat 10 times de CV
md <- FRESA.Model(formula = Surv(pgtime, pgstat) ~ 1,
data = dataCancerImputed, CVfolds = 10,
repeats = 10,
var.description = cancerVarNames[,2])
pt <- plotModels.ROC(md$cvObject$Models.testPrediction,theCVfolds = 10)
print(pt$predictionTable)
pt <- plotModels.ROC(md$cvObject$LASSO.testPredictions,theCVfolds = 10)
pt <- plotModels.ROC(md$cvObject$KNN.testPrediction,theCVfolds = 10)
# Get a regression of the survival time
timeSubjects <- dataCancerImputed
timeSubjects$pgtime <- log(timeSubjects$pgtime)
md <- FRESA.Model(formula = pgtime ~ 1,
data = timeSubjects,
var.description = cancerVarNames[,2])
pt <- plot(md$bootstrappedModel)
sm <- summary(md$BSWiMS.model)
print(sm$coefficients)
# Get a logistic regression model using
# - The default parameters and removing time as possible predictor
dataCancerImputed$pgtime <- NULL
md <- FRESA.Model(formula = pgstat ~ 1,
data = dataCancerImputed,
var.description = cancerVarNames[,2])
pt <- plot(md$bootstrappedModel)
sm <- summary(md$BSWiMS.model)
print(sm$coefficients)
# Get a logistic regression model using:
# - residual-based optimization
md <- FRESA.Model(formula = pgstat ~ 1,
data = dataCancerImputed,
OptType = "Residual",
var.description = cancerVarNames[,2])
pt <- plot(md$bootstrappedModel)
sm <- summary(md$BSWiMS.model)
print(sm$coefficients)
# Shut down the graphics device driver
dev.off()
## End(Not run)