validate {cvms} | R Documentation |
Validate regression models on a test set
Description
Train linear or logistic regression models on a training set and validate it by
predicting a test/validation set.
Returns results in a tibble
for easy reporting, along with the trained models.
See validate_fn()
for use
with custom model functions.
Usage
validate(
train_data,
formulas,
family,
test_data = NULL,
partitions_col = ".partitions",
control = NULL,
REML = FALSE,
cutoff = 0.5,
positive = 2,
metrics = list(),
preprocessing = NULL,
err_nc = FALSE,
rm_nc = FALSE,
parallel = FALSE,
verbose = FALSE,
link = deprecated(),
models = deprecated(),
model_verbose = deprecated()
)
Arguments
train_data |
Can contain a grouping factor for identifying partitions - as made with
| |||||||||||
formulas |
Model formulas as strings. (Character) E.g. Can contain random effects. E.g. | |||||||||||
family |
Name of the family. (Character) Currently supports See | |||||||||||
test_data |
| |||||||||||
partitions_col |
Name of grouping factor for identifying partitions. (Character) Rows with the value N.B. Only used if | |||||||||||
control |
Construct control structures for mixed model fitting
(with | |||||||||||
REML |
Restricted Maximum Likelihood. (Logical) | |||||||||||
cutoff |
Threshold for predicted classes. (Numeric) N.B. Binomial models only | |||||||||||
positive |
Level from dependent variable to predict.
Either as character (preferable) or level index ( E.g. if we have the levels Note: For reproducibility, it's preferable to specify the name directly, as
different Used when calculating confusion matrix metrics and creating The N.B. Only affects evaluation metrics, not the model training or returned predictions. N.B. Binomial models only. | |||||||||||
metrics |
E.g. You can enable/disable all metrics at once by including
The Also accepts the string | |||||||||||
preprocessing |
Name of preprocessing to apply. Available preprocessings are:
The preprocessing parameters ( N.B. The preprocessings should not affect the results
to a noticeable degree, although | |||||||||||
err_nc |
Whether to raise an | |||||||||||
rm_nc |
Remove non-converged models from output. (Logical) | |||||||||||
parallel |
Whether to validate the list of models in parallel. (Logical) Remember to register a parallel backend first.
E.g. with | |||||||||||
verbose |
Whether to message process information like the number of model instances to fit and which model function was applied. (Logical) | |||||||||||
link , models , model_verbose |
Deprecated. |
Details
Packages used:
Models
Gaussian: stats::lm
, lme4::lmer
Binomial: stats::glm
, lme4::glmer
Results
Shared
AIC
: stats::AIC
AICc
: MuMIn::AICc
BIC
: stats::BIC
Gaussian
r2m
: MuMIn::r.squaredGLMM
r2c
: MuMIn::r.squaredGLMM
Binomial
ROC and AUC
: pROC::roc
Value
tibble
with the results and model objects.
Shared across families
A nested tibble
with coefficients of the models from all iterations.
Count of convergence warnings. Consider discarding models that did not converge.
Count of other warnings. These are warnings without keywords such as "convergence".
Count of Singular Fit messages. See
lme4::isSingular
for more information.
Nested tibble
with the warnings and messages caught for each model.
Specified family.
Nested model objects.
Name of dependent variable.
Names of fixed effects.
Names of random effects, if any.
Nested tibble
with preprocessing parameters, if any.
—————————————————————-
Gaussian Results
—————————————————————-
RMSE
, MAE
, NRMSE(IQR)
,
RRSE
, RAE
, RMSLE
,
AIC
, AICc
, and BIC
.
See the additional metrics (disabled by default) at ?gaussian_metrics
.
A nested tibble
with the predictions and targets.
—————————————————————-
Binomial Results
—————————————————————-
Based on predictions of the test set,
a confusion matrix and ROC
curve are used to get the following:
ROC
:
AUC
, Lower CI
, and Upper CI
.
Confusion Matrix
:
Balanced Accuracy
,
F1
,
Sensitivity
,
Specificity
,
Positive Predictive Value
,
Negative Predictive Value
,
Kappa
,
Detection Rate
,
Detection Prevalence
,
Prevalence
, and
MCC
(Matthews correlation coefficient).
See the additional metrics (disabled by default) at
?binomial_metrics
.
Also includes:
A nested tibble
with predictions, predicted classes (depends on cutoff
), and the targets.
Note, that the predictions are not necessarily of the specified positive
class, but of
the model's positive class (second level of dependent variable, alphabetically).
The pROC::roc
ROC
curve object(s).
A nested tibble
with the confusion matrix/matrices.
The Pos_
columns tells you whether a row is a
True Positive (TP
), True Negative (TN
),
False Positive (FP
), or False Negative (FN
),
depending on which level is the "positive" class. I.e. the level you wish to predict.
The name of the Positive Class.
Author(s)
Ludvig Renbo Olsen, r-pkgs@ludvigolsen.dk
See Also
Other validation functions:
cross_validate()
,
cross_validate_fn()
,
validate_fn()
Examples
# Attach packages
library(cvms)
library(groupdata2) # partition()
library(dplyr) # %>% arrange()
# Data is part of cvms
data <- participant.scores
# Set seed for reproducibility
set.seed(7)
# Partition data
# Keep as single data frame
# We could also have fed validate() separate train and test sets.
data_partitioned <- partition(
data,
p = 0.7,
cat_col = "diagnosis",
id_col = "participant",
list_out = FALSE
) %>%
arrange(.partitions)
# Validate a model
# Gaussian
validate(
data_partitioned,
formulas = "score~diagnosis",
partitions_col = ".partitions",
family = "gaussian",
REML = FALSE
)
# Binomial
validate(data_partitioned,
formulas = "diagnosis~score",
partitions_col = ".partitions",
family = "binomial"
)
## Feed separate train and test sets
# Partition data to list of data frames
# The first data frame will be train (70% of the data)
# The second will be test (30% of the data)
data_partitioned <- partition(
data,
p = 0.7,
cat_col = "diagnosis",
id_col = "participant",
list_out = TRUE
)
train_data <- data_partitioned[[1]]
test_data <- data_partitioned[[2]]
# Validate a model
# Gaussian
validate(
train_data,
test_data = test_data,
formulas = "score~diagnosis",
family = "gaussian",
REML = FALSE
)