R: Incremental prediction performance in regression

Incremental {sharp}

R Documentation

Incremental prediction performance in regression

Description

Computes the prediction performance of regression models where predictors are sequentially added by order of decreasing selection proportion. This function can be used to evaluate the marginal contribution of each of the selected predictors over and above more stable predictors. Performances are evaluated as in ExplanatoryPerformance.

Usage

Incremental(
  xdata,
  ydata,
  new_xdata = NULL,
  new_ydata = NULL,
  stability = NULL,
  family = NULL,
  implementation = NULL,
  prediction = NULL,
  resampling = "subsampling",
  n_predictors = NULL,
  K = 100,
  tau = 0.8,
  seed = 1,
  n_thr = NULL,
  time = 1000,
  verbose = TRUE,
  ...
)

Arguments

`xdata`	matrix of predictors with observations as rows and variables as columns.
`ydata`	optional vector or matrix of outcome(s). If `family` is set to `"binomial"` or `"multinomial"`, `ydata` can be a vector with character/numeric values or a factor.
`new_xdata`	optional test set (predictor data).
`new_ydata`	optional test set (outcome data).
`stability`	output of `VariableSelection`. If `stability=NULL` (the default), a model including all variables in `xdata` as predictors is fitted. Argument `family` must be provided in this case.
`family`	type of regression model. Possible values include `"gaussian"` (linear regression), `"binomial"` (logistic regression), and `"cox"` (survival analysis). If provided, this argument must be consistent with input `stability`.
`implementation`	optional function to refit the model. If `implementation=NULL` and `stability` is the output of `VariableSelection`, `lm` (linear regression), `coxph` (Cox regression), `glm` (logistic regression), or `multinom` (multinomial regression) is used.
`prediction`	optional function to compute predicted values from the model refitted with `implementation`.
`resampling`	resampling approach to create the training set. The default is `"subsampling"` for sampling without replacement of a proportion `tau` of the observations. Alternatively, this argument can be a function to use for resampling. This function must use arguments named `data` and `tau` and return the IDs of observations to be included in the resampled dataset.
`n_predictors`	number of predictors to consider.
`K`	number of training-test splits. Only used if `new_xdata` and `new_ydata` are not provided.
`tau`	proportion of observations used in the training set. Only used if `new_xdata` and `new_ydata` are not provided.
`seed`	value of the seed to ensure reproducibility of the results. Only used if `new_xdata` and `new_ydata` are not provided.
`n_thr`	number of thresholds to use to construct the ROC curve. If `n_thr=NULL`, all predicted probability values are iteratively used as thresholds. For faster computations on large data, less thresholds can be used. Only applicable to logistic regression.
`time`	numeric indicating the time for which the survival probabilities are computed. Only applicable to Cox regression.
`verbose`	logical indicating if a loading bar and messages should be printed.
`...`	additional parameters passed to the function provided in `resampling`.

Value

An object of class incremental.

For logistic regression, a list with:

`FPR`	A list with, for each of the models (sequentially added predictors), the False Positive Rates for different thresholds (columns) and different data splits (rows).
`TPR`	A list with, for each of the models (sequentially added predictors), the True Positive Rates for different thresholds (columns) and different data splits (rows).
`AUC`	A list with, for each of the models (sequentially added predictors), a vector of Area Under the Curve (AUC) values obtained with different data splits.
`Beta`	Estimated regression coefficients from visited models.
`names`	Names of the predictors by order of inclusion.
`stable`	Binary vector indicating which predictors are stably selected. Only returned if `stability` is provided.

For Cox regression, a list with:

`concordance`	A list with, for each of the models (sequentially added predictors), a vector of concordance indices obtained with different data splits.
`Beta`	Estimated regression coefficients from visited models.
`names`	Names of the predictors by order of inclusion.
`stable`	Binary vector indicating which predictors are stably selected. Only returned if `stability` is provided.

For linear regression, a list with:

`Q_squared`	A list with, for each of the models (sequentially added predictors), a vector of Q-squared obtained with different data splits.
`Beta`	Estimated regression coefficients from visited models.
`names`	Names of the predictors by order of inclusion.
`stable`	Binary vector indicating which predictors are stably selected. Only returned if `stability` is provided.

Examples


# Data simulation
set.seed(1)
simul <- SimulateRegression(
  n = 1000, pk = 20,
  family = "binomial", ev_xy = 0.8
)

# Data split: selection, training and test set
ids <- Split(
  data = simul$ydata,
  family = "binomial",
  tau = c(0.4, 0.3, 0.3)
)
xselect <- simul$xdata[ids[[1]], ]
yselect <- simul$ydata[ids[[1]], ]
xtrain <- simul$xdata[ids[[2]], ]
ytrain <- simul$ydata[ids[[2]], ]
xtest <- simul$xdata[ids[[3]], ]
ytest <- simul$ydata[ids[[3]], ]

# Stability selection
stab <- VariableSelection(
  xdata = xselect,
  ydata = yselect,
  family = "binomial"
)

# Performances in test set of model refitted in training set
incr <- Incremental(
  xdata = xtrain, ydata = ytrain,
  new_xdata = xtest, new_ydata = ytest,
  stability = stab, n_predictors = 10
)
plot(incr)

# Alternative with multiple training/test splits
incr <- Incremental(
  xdata = rbind(xtrain, xtest),
  ydata = c(ytrain, ytest),
  stability = stab, K = 10, n_predictors = 10
)
plot(incr)