ExplanatoryPerformance {sharp}R Documentation

Prediction performance in regression

Description

Calculates model performance for linear (measured by Q-squared), logistic (AUC) or Cox (C-statistic) regression. This is done by (i) refitting the model on a training set including a proportion tau of the observations, and (ii) evaluating the performance on the remaining observations (test set). For more reliable results, the procedure can be repeated K times (default K=1).

Usage

ExplanatoryPerformance(
  xdata,
  ydata,
  new_xdata = NULL,
  new_ydata = NULL,
  stability = NULL,
  family = NULL,
  implementation = NULL,
  prediction = NULL,
  resampling = "subsampling",
  K = 1,
  tau = 0.8,
  seed = 1,
  n_thr = NULL,
  time = 1000,
  verbose = FALSE,
  ...
)

Arguments

xdata

matrix of predictors with observations as rows and variables as columns.

ydata

optional vector or matrix of outcome(s). If family is set to "binomial" or "multinomial", ydata can be a vector with character/numeric values or a factor.

new_xdata

optional test set (predictor data).

new_ydata

optional test set (outcome data).

stability

output of VariableSelection. If stability=NULL (the default), a model including all variables in xdata as predictors is fitted. Argument family must be provided in this case.

family

type of regression model. Possible values include "gaussian" (linear regression), "binomial" (logistic regression), and "cox" (survival analysis). If provided, this argument must be consistent with input stability.

implementation

optional function to refit the model. If implementation=NULL and stability is the output of VariableSelection, lm (linear regression), coxph (Cox regression), glm (logistic regression), or multinom (multinomial regression) is used.

prediction

optional function to compute predicted values from the model refitted with implementation.

resampling

resampling approach to create the training set. The default is "subsampling" for sampling without replacement of a proportion tau of the observations. Alternatively, this argument can be a function to use for resampling. This function must use arguments named data and tau and return the IDs of observations to be included in the resampled dataset.

K

number of training-test splits. Only used if new_xdata and new_ydata are not provided.

tau

proportion of observations used in the training set. Only used if new_xdata and new_ydata are not provided.

seed

value of the seed to ensure reproducibility of the results. Only used if new_xdata and new_ydata are not provided.

n_thr

number of thresholds to use to construct the ROC curve. If n_thr=NULL, all predicted probability values are iteratively used as thresholds. For faster computations on large data, less thresholds can be used. Only applicable to logistic regression.

time

numeric indicating the time for which the survival probabilities are computed. Only applicable to Cox regression.

verbose

logical indicating if a loading bar and messages should be printed.

...

additional parameters passed to the function provided in resampling.

Details

For a fair evaluation of the prediction performance, the data is split into a training set (including a proportion tau of the observations) and test set (remaining observations). The regression model is fitted on the training set and applied on the test set. Performance metrics are computed in the test set by comparing predicted and observed outcomes.

For logistic regression, a Receiver Operating Characteristic (ROC) analysis is performed: the True and False Positive Rates (TPR and FPR), and Area Under the Curve (AUC) are computed for different thresholds in predicted probabilities.

For Cox regression, the Concordance Index (as implemented in concordance) looking at survival probabilities up to a specific time is computed.

For linear regression, the squared correlation between predicted and observed outcome in the test set (Q-squared) is reported.

Value

A list with:

TPR

True Positive Rate (for logistic regression only).

FPR

False Positive Rate (for logistic regression only).

AUC

Area Under the Curve (for logistic regression only).

concordance

Concordance index (for Cox regression only).

Beta

matrix of estimated beta coefficients across the K iterations. Coefficients are extracted using the coef function.

See Also

VariableSelection, Refit

Other prediction performance functions: Incremental()

Examples


# Data simulation
set.seed(1)
simul <- SimulateRegression(
  n = 1000, pk = 20,
  family = "binomial", ev_xy = 0.8
)

# Data split: selection, training and test set
ids <- Split(
  data = simul$ydata,
  family = "binomial",
  tau = c(0.4, 0.3, 0.3)
)
xselect <- simul$xdata[ids[[1]], ]
yselect <- simul$ydata[ids[[1]], ]
xtrain <- simul$xdata[ids[[2]], ]
ytrain <- simul$ydata[ids[[2]], ]
xtest <- simul$xdata[ids[[3]], ]
ytest <- simul$ydata[ids[[3]], ]

# Stability selection
stab <- VariableSelection(
  xdata = xselect,
  ydata = yselect,
  family = "binomial"
)

# Performances in test set of model refitted in training set
roc <- ExplanatoryPerformance(
  xdata = xtrain, ydata = ytrain,
  new_xdata = xtest, new_ydata = ytest,
  stability = stab
)
plot(roc)
roc$AUC

# Alternative with multiple training/test splits
roc <- ExplanatoryPerformance(
  xdata = rbind(xtrain, xtest),
  ydata = c(ytrain, ytest),
  stability = stab, K = 100
)
plot(roc)
boxplot(roc$AUC)

# Partial Least Squares Discriminant Analysis
if (requireNamespace("sgPLS", quietly = TRUE)) {
  stab <- VariableSelection(
    xdata = xselect,
    ydata = yselect,
    implementation = SparsePLS,
    family = "binomial"
  )

  # Defining wrapping functions for predictions from PLS-DA
  PLSDA <- function(xdata, ydata, family = "binomial") {
    model <- mixOmics::plsda(X = xdata, Y = as.factor(ydata), ncomp = 1)
    return(model)
  }
  PredictPLSDA <- function(xdata, model) {
    xdata <- xdata[, rownames(model$loadings$X), drop = FALSE]
    predicted <- predict(object = model, newdata = xdata)$predict[, 2, 1]
    return(predicted)
  }

  # Performances with custom models
  roc <- ExplanatoryPerformance(
    xdata = rbind(xtrain, xtest),
    ydata = c(ytrain, ytest),
    stability = stab, K = 100,
    implementation = PLSDA, prediction = PredictPLSDA
  )
  plot(roc)
}


[Package sharp version 1.4.6 Index]