ExplanatoryPerformance {sharp} | R Documentation |
Prediction performance in regression
Description
Calculates model performance for linear (measured by Q-squared), logistic
(AUC) or Cox (C-statistic) regression. This is done by (i) refitting the
model on a training set including a proportion tau
of the
observations, and (ii) evaluating the performance on the remaining
observations (test set). For more reliable results, the procedure can be
repeated K
times (default K=1
).
Usage
ExplanatoryPerformance(
xdata,
ydata,
new_xdata = NULL,
new_ydata = NULL,
stability = NULL,
family = NULL,
implementation = NULL,
prediction = NULL,
resampling = "subsampling",
K = 1,
tau = 0.8,
seed = 1,
n_thr = NULL,
time = 1000,
verbose = FALSE,
...
)
Arguments
xdata |
matrix of predictors with observations as rows and variables as columns. |
ydata |
optional vector or matrix of outcome(s). If |
new_xdata |
optional test set (predictor data). |
new_ydata |
optional test set (outcome data). |
stability |
output of |
family |
type of regression model. Possible values include
|
implementation |
optional function to refit the model. If
|
prediction |
optional function to compute predicted values from the
model refitted with |
resampling |
resampling approach to create the training set. The default
is |
K |
number of training-test splits. Only used if |
tau |
proportion of observations used in the training set. Only used if
|
seed |
value of the seed to ensure reproducibility of the results. Only
used if |
n_thr |
number of thresholds to use to construct the ROC curve. If
|
time |
numeric indicating the time for which the survival probabilities are computed. Only applicable to Cox regression. |
verbose |
logical indicating if a loading bar and messages should be printed. |
... |
additional parameters passed to the function provided in
|
Details
For a fair evaluation of the prediction performance, the data is
split into a training set (including a proportion tau
of the
observations) and test set (remaining observations). The regression model
is fitted on the training set and applied on the test set. Performance
metrics are computed in the test set by comparing predicted and observed
outcomes.
For logistic regression, a Receiver Operating Characteristic (ROC) analysis is performed: the True and False Positive Rates (TPR and FPR), and Area Under the Curve (AUC) are computed for different thresholds in predicted probabilities.
For Cox regression, the Concordance Index (as implemented in
concordance
) looking at survival probabilities up
to a specific time
is computed.
For linear regression, the squared correlation between predicted and observed outcome in the test set (Q-squared) is reported.
Value
A list with:
TPR |
True Positive Rate (for logistic regression only). |
FPR |
False Positive Rate (for logistic regression only). |
AUC |
Area Under the Curve (for logistic regression only). |
concordance |
Concordance index (for Cox regression only). |
Beta |
matrix of estimated beta coefficients across the |
See Also
Other prediction performance functions:
Incremental()
Examples
# Data simulation
set.seed(1)
simul <- SimulateRegression(
n = 1000, pk = 20,
family = "binomial", ev_xy = 0.8
)
# Data split: selection, training and test set
ids <- Split(
data = simul$ydata,
family = "binomial",
tau = c(0.4, 0.3, 0.3)
)
xselect <- simul$xdata[ids[[1]], ]
yselect <- simul$ydata[ids[[1]], ]
xtrain <- simul$xdata[ids[[2]], ]
ytrain <- simul$ydata[ids[[2]], ]
xtest <- simul$xdata[ids[[3]], ]
ytest <- simul$ydata[ids[[3]], ]
# Stability selection
stab <- VariableSelection(
xdata = xselect,
ydata = yselect,
family = "binomial"
)
# Performances in test set of model refitted in training set
roc <- ExplanatoryPerformance(
xdata = xtrain, ydata = ytrain,
new_xdata = xtest, new_ydata = ytest,
stability = stab
)
plot(roc)
roc$AUC
# Alternative with multiple training/test splits
roc <- ExplanatoryPerformance(
xdata = rbind(xtrain, xtest),
ydata = c(ytrain, ytest),
stability = stab, K = 100
)
plot(roc)
boxplot(roc$AUC)
# Partial Least Squares Discriminant Analysis
if (requireNamespace("sgPLS", quietly = TRUE)) {
stab <- VariableSelection(
xdata = xselect,
ydata = yselect,
implementation = SparsePLS,
family = "binomial"
)
# Defining wrapping functions for predictions from PLS-DA
PLSDA <- function(xdata, ydata, family = "binomial") {
model <- mixOmics::plsda(X = xdata, Y = as.factor(ydata), ncomp = 1)
return(model)
}
PredictPLSDA <- function(xdata, model) {
xdata <- xdata[, rownames(model$loadings$X), drop = FALSE]
predicted <- predict(object = model, newdata = xdata)$predict[, 2, 1]
return(predicted)
}
# Performances with custom models
roc <- ExplanatoryPerformance(
xdata = rbind(xtrain, xtest),
ydata = c(ytrain, ytest),
stability = stab, K = 100,
implementation = PLSDA, prediction = PredictPLSDA
)
plot(roc)
}