Incremental {sharp}R Documentation

Incremental prediction performance in regression

Description

Computes the prediction performance of regression models where predictors are sequentially added by order of decreasing selection proportion. This function can be used to evaluate the marginal contribution of each of the selected predictors over and above more stable predictors. Performances are evaluated as in ExplanatoryPerformance.

Usage

Incremental(
  xdata,
  ydata,
  new_xdata = NULL,
  new_ydata = NULL,
  stability = NULL,
  family = NULL,
  implementation = NULL,
  prediction = NULL,
  resampling = "subsampling",
  n_predictors = NULL,
  K = 100,
  tau = 0.8,
  seed = 1,
  n_thr = NULL,
  time = 1000,
  verbose = TRUE,
  ...
)

Arguments

xdata

matrix of predictors with observations as rows and variables as columns.

ydata

optional vector or matrix of outcome(s). If family is set to "binomial" or "multinomial", ydata can be a vector with character/numeric values or a factor.

new_xdata

optional test set (predictor data).

new_ydata

optional test set (outcome data).

stability

output of VariableSelection. If stability=NULL (the default), a model including all variables in xdata as predictors is fitted. Argument family must be provided in this case.

family

type of regression model. Possible values include "gaussian" (linear regression), "binomial" (logistic regression), and "cox" (survival analysis). If provided, this argument must be consistent with input stability.

implementation

optional function to refit the model. If implementation=NULL and stability is the output of VariableSelection, lm (linear regression), coxph (Cox regression), glm (logistic regression), or multinom (multinomial regression) is used.

prediction

optional function to compute predicted values from the model refitted with implementation.

resampling

resampling approach to create the training set. The default is "subsampling" for sampling without replacement of a proportion tau of the observations. Alternatively, this argument can be a function to use for resampling. This function must use arguments named data and tau and return the IDs of observations to be included in the resampled dataset.

n_predictors

number of predictors to consider.

K

number of training-test splits. Only used if new_xdata and new_ydata are not provided.

tau

proportion of observations used in the training set. Only used if new_xdata and new_ydata are not provided.

seed

value of the seed to ensure reproducibility of the results. Only used if new_xdata and new_ydata are not provided.

n_thr

number of thresholds to use to construct the ROC curve. If n_thr=NULL, all predicted probability values are iteratively used as thresholds. For faster computations on large data, less thresholds can be used. Only applicable to logistic regression.

time

numeric indicating the time for which the survival probabilities are computed. Only applicable to Cox regression.

verbose

logical indicating if a loading bar and messages should be printed.

...

additional parameters passed to the function provided in resampling.

Value

An object of class incremental.

For logistic regression, a list with:

FPR

A list with, for each of the models (sequentially added predictors), the False Positive Rates for different thresholds (columns) and different data splits (rows).

TPR

A list with, for each of the models (sequentially added predictors), the True Positive Rates for different thresholds (columns) and different data splits (rows).

AUC

A list with, for each of the models (sequentially added predictors), a vector of Area Under the Curve (AUC) values obtained with different data splits.

Beta

Estimated regression coefficients from visited models.

names

Names of the predictors by order of inclusion.

stable

Binary vector indicating which predictors are stably selected. Only returned if stability is provided.

For Cox regression, a list with:

concordance

A list with, for each of the models (sequentially added predictors), a vector of concordance indices obtained with different data splits.

Beta

Estimated regression coefficients from visited models.

names

Names of the predictors by order of inclusion.

stable

Binary vector indicating which predictors are stably selected. Only returned if stability is provided.

For linear regression, a list with:

Q_squared

A list with, for each of the models (sequentially added predictors), a vector of Q-squared obtained with different data splits.

Beta

Estimated regression coefficients from visited models.

names

Names of the predictors by order of inclusion.

stable

Binary vector indicating which predictors are stably selected. Only returned if stability is provided.

See Also

VariableSelection, Refit

Other prediction performance functions: ExplanatoryPerformance()

Examples


# Data simulation
set.seed(1)
simul <- SimulateRegression(
  n = 1000, pk = 20,
  family = "binomial", ev_xy = 0.8
)

# Data split: selection, training and test set
ids <- Split(
  data = simul$ydata,
  family = "binomial",
  tau = c(0.4, 0.3, 0.3)
)
xselect <- simul$xdata[ids[[1]], ]
yselect <- simul$ydata[ids[[1]], ]
xtrain <- simul$xdata[ids[[2]], ]
ytrain <- simul$ydata[ids[[2]], ]
xtest <- simul$xdata[ids[[3]], ]
ytest <- simul$ydata[ids[[3]], ]

# Stability selection
stab <- VariableSelection(
  xdata = xselect,
  ydata = yselect,
  family = "binomial"
)

# Performances in test set of model refitted in training set
incr <- Incremental(
  xdata = xtrain, ydata = ytrain,
  new_xdata = xtest, new_ydata = ytest,
  stability = stab, n_predictors = 10
)
plot(incr)

# Alternative with multiple training/test splits
incr <- Incremental(
  xdata = rbind(xtrain, xtest),
  ydata = c(ytrain, ytest),
  stability = stab, K = 10, n_predictors = 10
)
plot(incr)



[Package sharp version 1.4.6 Index]