FRESA.CAD-package {FRESA.CAD} | R Documentation |
FeatuRE Selection Algorithms for Computer-Aided Diagnosis (FRESA.CAD)
Description
Contains a set of utilities for building and testing formula-based models for Computer Aided Diagnosis/prognosis applications via feature selection. Bootstrapped Stage Wise Model Selection (B:SWiMS) controls the false selection (FS) for linear, logistic, or Cox proportional hazards regression models. Utilities include functions for: univariate/longitudinal analysis, data conditioning (i.e. covariate adjustment and normalization), model validation and visualization.
Details
Package: | FRESA.CAD |
Type: | Package |
Version: | 3.4.8 |
Date: | 2024-06-25 |
License: | LGPL (>= 2) |
Purpose: The design of diagnostic or prognostic multivariate models via the selection of significantly discriminant features. The models are selected via the bootstrapped step-wise selection of model features that offer a significant improvement in subject classification/error. The false selection control is achieved by train-test partitions, where train sets are used to select variables and test sets used to evaluate model performance. Variables that do not improve subject classification/error on the blind test are not included in the models. The main function of this package is the selection and cross-validation of diagnostic/prognostic linear, logistic, or Cox proportional hazards regression model constructed from a large set of candidate features. The variable selection may start by conditioning all variables via a covariate-adjustment and a z-inverse-rank-transformation. In order to integrate features with partial discriminant power, the package can be used to categorize the continuous variables and rank their discriminant power. Once ranked, each feature is bootstrap-tested in a multivariate model, and its blind performance is evaluated. Variables with a statistical significant improvement in classification/error are stored and finally inserted into the final model according to their relative store frequency. A cross-validation procedure may be used to diagnose the amount of model shrinkage produced by the selection scheme.
Author(s)
Jose Gerardo Tamez-Pena, Antonio Martinez-Torteya, Israel Alanis and Jorge Orozco Maintainer: <jose.tamezpena@tec.mx>
References
Pencina, M. J., D'Agostino, R. B., & Vasan, R. S. (2008). Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Statistics in medicine 27(2), 157-172.
Examples
## Not run:
### Fresa Package Examples ####
library("epiR")
library("FRESA.CAD")
library(network)
library(GGally)
library("e1071")
# Start the graphics device driver to save all plots in a pdf format
pdf(file = "Fresa.Package.Example.pdf",width = 8, height = 6)
# Get the stage C prostate cancer data from the rpart package
data(stagec,package = "rpart")
options(na.action = 'na.pass')
dataCancer <- cbind(pgstat = stagec$pgstat,
pgtime = stagec$pgtime,
as.data.frame(model.matrix(Surv(pgtime,pgstat) ~ .,stagec))[-1])
#Impute missing values
dataCancerImputed <- nearestNeighborImpute(dataCancer)
# Remove the incomplete cases
dataCancer <- dataCancer[complete.cases(dataCancer),]
# Load a pre-stablished data frame with the names and descriptions of all variables
data(cancerVarNames)
# the Heat Map
hm <- heatMaps(cancerVarNames,varRank=NULL,Outcome="pgstat",
data=dataCancer,title="Heat Map",hCluster=FALSE
,prediction=NULL,Scale=TRUE,
theFiveColors=c("blue","cyan","black","yellow","red"),
outcomeColors =
c("blue","lightgreen","yellow","orangered","red"),
transpose=FALSE,cexRow=0.50,cexCol=0.80,srtCol=35)
# The univariate analysis
UniRankFeaturesRaw <- univariateRankVariables(variableList = cancerVarNames,
formula = "pgstat ~ 1+pgtime",
Outcome = "pgstat",
data = dataCancer,
categorizationType = "Raw",
type = "LOGIT",
rankingTest = "zIDI",
description = "Description",
uniType="Binary")
print(UniRankFeaturesRaw)
# A simple BSIWMS Model
BSWiMSModel <- BSWiMS.model(formula = Surv(pgtime, pgstat) ~ 1, dataCancerImputed)
# The Log-Rank Analysis using survdiff
lrsurvdiff <- survdiff(Surv(pgtime,pgstat)~
BSWiMSModel$BSWiMS.model$back.model$linear.predictors > 0,
data=dataCancerImputed)
# The Log-Rank Analysis EmpiricalSurvDiff and permutations of the null Chi distribution
lrp <- EmpiricalSurvDiff(dataCancerImputed$pgtime,dataCancerImputed$pgstat,
BSWiMSModel$BSWiMS.model$back.model$linear.predictors > 0,
type="Chi",plots=TRUE,samples=10000)
# The Log-Rank Analysis EmpiricalSurvDiff and permutations of the null SLR distribution
lrp <- EmpiricalSurvDiff(dataCancerImputed$pgtime,dataCancerImputed$pgstat,
BSWiMSModel$BSWiMS.model$back.model$linear.predictors > 0,
type="SLR",plots=TRUE,samples=10000)
# The Log-Rank Analysis EmpiricalSurvDiff and bootstrapping the SLR distribution
lrp <- EmpiricalSurvDiff(dataCancerImputed$pgtime,dataCancerImputed$pgstat,
BSWiMSModel$BSWiMS.model$back.model$linear.predictors > 0,
computeDist=TRUE,plots=TRUE)
#The performance of the final model using the summary function
sm <- summary(BSWiMSModel$BSWiMS.model$back.model)
print(sm$coefficients)
pv <- plot(sm$bootstrap)
# The equivalent model
eq <- reportEquivalentVariables(BSWiMSModel$BSWiMS.model$back.model,data=dataCancer,
variableList=cancerVarNames,Outcome = "pgstat",
timeOutcome="pgtime",
type = "COX");
print(eq$equivalentMatrix)
#The list of all models of the bootstrap forward selection
print(BSWiMSModel$forward.selection.list)
#With FRESA.CAD we can do a leave-one-out using the list of models
pm <- ensemblePredict(BSWiMSModel$forward.selection.list,
dataCancer,predictType = "linear",type="LOGIT",Outcome="pgstat")
#Ploting the ROC with 95
pm <- plotModels.ROC(cbind(dataCancer$pgstat,
pm$ensemblePredict),main=("LOO Forward Selection Median Predict"))
#The plotModels.ROC provides the diagnosis confusion matrix.
summary(epi.tests(pm$predictionTable))
#FRESA.CAD can be used to create a bagged model using the forward selection formulas
bagging <- baggedModel(BSWiMSModel$forward.selection.list,dataCancer,useFreq=32)
pm <- predict(bagging$bagged.model)
pm <- plotModels.ROC(cbind(dataCancer$pgstat,pm),main=("Bagged"))
#Let's check the performance of the model
sm <- summary(bagging$bagged.model)
print(sm$coefficients)
#Using bootstrapping object I can check the Jaccard Index
print(bagging$Jaccard.SM)
#Ploting the evolution of the coefficient value
plot(bagging$coefEvolution$grade,main="Evolution of grade")
gplots::heatmap.2(bagging$formulaNetwork,trace="none",
mar=c(10,10),main="eB:SWIMS Formula Network")
barplot(bagging$frequencyTable,las = 2,cex.axis=1.0,
cex.names=0.75,main="Feature Frequency")
n <- network::network(bagging$formulaNetwork, directed = FALSE,
ignore.eval = FALSE,names.eval = "weights")
ggnet2(n, label = TRUE, size = "degree",size.cut = 3,size.min = 1,
mode = "circle",edge.label = "weights",edge.label.size=4)
# Get a Cox proportional hazards model using:
# - The default parameters
mdCOXs <- FRESA.Model(formula = Surv(pgtime, pgstat) ~ 1,data = dataCancer)
sm <- summary(mdCOXs$BSWiMS.model)
print(sm$coefficients)
# The model with singificant improvement in the residual error
mdCOXs <- FRESA.Model(formula = Surv(pgtime, pgstat) ~ 1,
data = dataCancer,OptType = "Residual" )
sm <- summary(mdCOXs$BSWiMS.model)
print(sm$coefficients)
# Get a Cox proportional hazards model using second order models:
mdCOX <- FRESA.Model(formula = Surv(pgtime, pgstat) ~ 1,
data = dataCancer,categorizationType="RawRaw")
sm <- summary(mdCOX$BSWiMS.model)
print(sm$coefficients)
namesc <- names(mdCOX$BSWiMS.model$coefficients)[-1]
hm <- heatMaps(mdCOX$univariateAnalysis[namesc,],varRank=NULL,
Outcome="pgstat",data=dataCancer,
title="Heat Map",hCluster=FALSE,prediction=NULL,Scale=TRUE,
theFiveColors=c("blue","cyan","black","yellow","red"),
outcomeColors = c("blue","lightgreen","yellow","orangered","red"),
transpose=FALSE,cexRow=0.50,cexCol=0.80,srtCol=35)
# The LOO estimation
pm <- ensemblePredict(mdCOX$BSWiMS.models$formula.list,dataCancer,
predictType = "linear",type="LOGIT",Outcome="pgstat")
pm <- plotModels.ROC(cbind(dataCancer$pgstat,pm$ensemblePredict),main=("LOO Median Predict"))
#Let us check the diagnosis performance
summary(epi.tests(pm$predictionTable))
# Get a Logistic model using FRESA.Model
# - The default parameters
dataCancer2 <-dataCancer
dataCancer2$pgtime <-NULL
mdLOGIT <- FRESA.Model(formula = pgstat ~ 1,data = dataCancer2)
if (!is.null(mdLOGIT$bootstrappedModel)) pv <- plot(mdLOGIT$bootstrappedModel)
sm <- summary(mdLOGIT$BSWiMS.model)
print(sm$coefficients)
## FRESA.Model with Cross Validation and Recursive Partitioning and Regression Trees
md <- FRESA.Model(formula = Surv(pgtime, pgstat) ~ 1,data = dataCancer,
CVfolds = 10,repeats = 5,equivalent = TRUE,usrFitFun=rpart::rpart)
colnames(md$cvObject$Models.testPrediction)
pm <- plotModels.ROC(md$cvObject$LASSO.testPredictions,theCVfolds=10,main="CV LASSO",cex=0.90)
pm <- plotModels.ROC(md$cvObject$KNN.testPrediction,theCVfolds=10,main="KNN",cex=0.90)
pm <- plotModels.ROC(md$cvObject$Models.testPrediction,theCVfolds=10,
predictor="Prediction",main="B:SWiMS Bagging",cex=0.90)
pm <- plotModels.ROC(md$cvObject$Models.testPrediction,theCVfolds=10,
predictor="Ensemble.B.SWiMS"
,main="Forward Selection Median Ensemble",cex=0.90)
pm <- plotModels.ROC(md$cvObject$Models.testPrediction,theCVfolds=10,
predictor="Ensemble.Forward",main="Forward Selection Bagging",cex=0.90)
pm <- plotModels.ROC(md$cvObject$Models.testPrediction,theCVfolds=10,
predictor="eB.SWiMS",main="Equivalent Model",cex=0.90)
pm <- plotModels.ROC(md$cvObject$Models.testPrediction,theCVfolds=10,
predictor="Forward.Selection.Bagged",main="The Forward Bagging",cex=0.90)
pm <- plotModels.ROC(md$cvObject$Models.testPrediction,theCVfolds=20,
predictor="usrFitFunction",
main="Recursive Partitioning and Regression Trees",cex=0.90)
pm <- plotModels.ROC(md$cvObject$Models.testPrediction,theCVfolds=20,
predictor="usrFitFunction_Sel",
main="Recursive Partitioning and Regression Trees with FS",cex=0.90)
## FRESA.Model with Cross Validation, LOGISTIC and Support Vector Machine
md <- FRESA.Model(formula = pgstat ~ 1,data = dataCancer2,
CVfolds = 10,repeats = 5,equivalent = TRUE,usrFitFun=svm)
pm <- plotModels.ROC(md$cvObject$LASSO.testPredictions,theCVfolds=10,main="CV LASSO",cex=0.90)
pm <- plotModels.ROC(md$cvObject$KNN.testPrediction,theCVfolds=10,main="KNN",cex=0.90)
pm <- plotModels.ROC(md$cvObject$Models.testPrediction,theCVfolds=10,
predictor="Prediction",main="B:SWiMS Bagging",cex=0.90)
md$cvObject$Models.testPrediction[,"usrFitFunction"] <-
md$cvObject$Models.testPrediction[,"usrFitFunction"] - 0.5
md$cvObject$Models.testPrediction[,"usrFitFunction_Sel"] <-
md$cvObject$Models.testPrediction[,"usrFitFunction_Sel"] - 0.5
pm <- plotModels.ROC(md$cvObject$Models.testPrediction,theCVfolds=10,
predictor="usrFitFunction",
main="SVM",cex = 0.90)
pm <- plotModels.ROC(md$cvObject$Models.testPrediction,theCVfolds=10,
predictor="usrFitFunction_Sel",
main="SVM with FS",cex = 0.90)
# Shut down the graphics device driver
dev.off()
## End(Not run)