pred_validate {predRupdate} | R Documentation |
Validate an existing prediction
Description
Validate an existing prediction model, to calculate the predictive performance against a new (validation) dataset.
Usage
pred_validate(
x,
new_data,
binary_outcome = NULL,
survival_time = NULL,
event_indicator = NULL,
time_horizon = NULL,
cal_plot = TRUE,
...
)
Arguments
x |
an object of class " |
new_data |
data.frame upon which the prediction model should be evaluated. |
binary_outcome |
Character variable giving the name of the column in
|
survival_time |
Character variable giving the name of the column in
|
event_indicator |
Character variable giving the name of the column in
|
time_horizon |
for survival models, an integer giving the time horizon (post baseline) at which a prediction is required. Currently, this must match a time in x$cum_hazard. |
cal_plot |
indicate if a flexible calibration plot should be produced (TRUE) or not (FALSE). |
... |
further plotting arguments for the calibration plot. See Details below. |
Details
This function takes an existing prediction model formatted according
to pred_input_info
, and calculates measures of predictive
performance on new data (e.g., within an external validation study). The
information about the existing prediction model should first be inputted by
calling pred_input_info
, before passing the resulting object
to pred_validate
.
new_data
should be a data.frame, where each row should be an
observation (e.g. patient) and each variable/column should be a predictor
variable. The predictor variables need to include (as a minimum) all of the
predictor variables that are included in the existing prediction model
(i.e., each of the variable names supplied to
pred_input_info
, through the model_info
parameter,
must match the name of a variables in new_data
).
Any factor variables within new_data
must be converted to dummy
(0/1) variables before calling this function. dummy_vars
can
help with this. See pred_predict
for examples.
binary_outcome
, survival_time
and event_indicator
are
used to specify the outcome variable(s) within new_data
(use
binary_outcome
if x$model_type
= "logistic", or use
survival_time
and event_indicator
if x$model_type
=
"survival").
In the case of validating a logistic regression model, this function assesses the predictive performance of the predicted risks against an observed binary outcome. Various metrics of calibration (agreement between the observed risk and the predicted risks, across the full risk range) and discrimination (ability of the model to distinguish between those who develop the outcome and those who do not) are calculated. For calibration, the observed-to-expected ratio, calibration intercept and calibration slopes are estimated. The calibration intercept is estimated by fitting a logistic regression model to the observed binary outcomes, with the linear predictor of the model as an offset. For calibration slope, a logistic regression model is fit to the observed binary outcome with the linear predictor from the model as the only covariate. For discrimination, the function estimates the area under the receiver operating characteristic curve (AUC). Various other metrics are also calculated to assess overall accuracy (Brier score, Cox-Snell R2).
In the case of validating a survival prediction model, this function
assesses the predictive performance of the linear predictor and
(optionally) the predicted event probabilities at a fixed time horizon
against an observed time-to-event outcome. Various metrics of calibration
and discrimination are calculated. For calibration, the
observed-to-expected ratio at the specified time_horizon
(if
predicted risks are available through specification of x$cum_hazard
)
and calibration slope are produced. For discrimination, Harrell's
C-statistic is calculated.
For both model types, a flexible calibration plot is produced (for survival
models, the cumulative baseline hazard must be available in the
predinfo
object, x$cum_hazard
). Specify parameter
cal_plot
to indicate whether a calibration plot should be produced
(TRUE), or not (FALSE). The calibration plot is produced by regressing the
observed outcomes against a cubic spline of the logit of predicted risks
(for a logistic model) or the complementary log-log of the predicted risks
(for a survival model). Users can specify parameters to modify the
calibration plot. Specifically, one can specify: xlab
, ylab
,
xlim
, and ylim
to change plotting characteristics for the
calibration plot. A rug can be added to the x-axis of the plot by setting
pred_rug
as TRUE; this can be used to show the predicted risk
distribution by outcome status.
Value
pred_validate
returns an object of class
"predvalidate
", with child classes per model_type
. This is a
list of performance metrics, estimated by applying the existing prediction
model to the new_data. An object of class "predvalidate
" is a list
containing relevant calibration and discrimination measures. For logistic
regression models, this will include observed:expected ratio,
calibration-intercept, calibration slope, area under the ROC curve,
R-squared, and Brier Score. For survival models, this will include
observed:expected ratio (if cum_hazard
is provided to x
),
calibration slope, and Harrell's C-statistic. Optionally, a flexible
calibration plot is also produced, along with a box-plot and violin plot of
the predicted risk distribution.
The summary
function can be used to extract and print summary
performance results (calibration and discrimination metrics). The graphical
assessments of performance can be extracted using plot
.
See Also
Examples
#Example 1 - multiple existing model, with outcome specified; uses
# an example dataset within the package
model1 <- pred_input_info(model_type = "logistic",
model_info = SYNPM$Existing_logistic_models)
val_results <- pred_validate(x = model1,
new_data = SYNPM$ValidationData,
binary_outcome = "Y",
cal_plot = FALSE)
summary(val_results)