Train linear or logistic regression models on a training set and validate it by
predicting a test/validation set.
Returns results in a `tibble`

for easy reporting, along with the trained models.

See `validate_fn()`

for use
with custom model functions.

```
validate(
train_data,
formulas,
family,
test_data = NULL,
partitions_col = ".partitions",
control = NULL,
REML = FALSE,
cutoff = 0.5,
positive = 2,
metrics = list(),
preprocessing = NULL,
err_nc = FALSE,
rm_nc = FALSE,
parallel = FALSE,
verbose = FALSE,
link = deprecated(),
models = deprecated(),
model_verbose = deprecated()
)
```

`train_data` |
Can contain a grouping factor for identifying partitions - as made with
| |||||||||||

`formulas` |
Model formulas as strings. (Character) E.g. Can contain random effects. E.g. | |||||||||||

`family` |
Name of the family. (Character) Currently supports `lm()` / `lme4::lmer()`
and for binary classification
with `"binomial"` `glm()` / `lme4::glmer()` .
See | |||||||||||

`test_data` |
| |||||||||||

`partitions_col` |
Name of grouping factor for identifying partitions. (Character) Rows with the value N.B. | |||||||||||

`control` |
Construct control structures for mixed model fitting
(with | |||||||||||

`REML` |
Restricted Maximum Likelihood. (Logical) | |||||||||||

`cutoff` |
Threshold for predicted classes. (Numeric) N.B. | |||||||||||

`positive` |
Level from dependent variable to predict.
Either as character ( E.g. if we have the levels
Used when calculating confusion matrix metrics and creating The N.B. Only affects evaluation metrics, not the model training or returned predictions. N.B. | |||||||||||

`metrics` |
E.g. You can enable/disable all metrics at once by including
The Also accepts the string | |||||||||||

`preprocessing` |
Name of preprocessing to apply. Available preprocessings are:
The preprocessing parameters ( N.B. The preprocessings should not affect the results
to a noticeable degree, although | |||||||||||

`err_nc` |
Whether to raise an | |||||||||||

`rm_nc` |
Remove non-converged models from output. (Logical) | |||||||||||

`parallel` |
Whether to validate the list of models in parallel. (Logical) Remember to register a parallel backend first.
E.g. with | |||||||||||

`verbose` |
Whether to message process information like the number of model instances to fit and which model function was applied. (Logical) | |||||||||||

`link, models, model_verbose` |
Deprecated. |

Packages used:

Gaussian: `stats::lm`

, `lme4::lmer`

Binomial: `stats::glm`

, `lme4::glmer`

`AIC`

: `stats::AIC`

`AICc`

: `MuMIn::AICc`

`BIC`

: `stats::BIC`

`r2m`

: `MuMIn::r.squaredGLMM`

`r2c`

: `MuMIn::r.squaredGLMM`

`ROC and AUC`

: `pROC::roc`

`tibble`

with the results and model objects.

A nested `tibble`

with **coefficients** of the models from all iterations.

Count of **convergence warnings**. Consider discarding models that did not converge.

Count of **other warnings**. These are warnings without keywords such as "convergence".

Count of **Singular Fit messages**. See
`lme4::isSingular`

for more information.

Nested `tibble`

with the **warnings and messages** caught for each model.

Specified **family**.

Nested **model** objects.

Name of **dependent** variable.

Names of **fixed** effects.

Names of **random** effects, if any.

Nested `tibble`

with **preprocess**ing parameters, if any.

** RMSE**,

`MAE`

`NRMSE(IQR)`

`RRSE`

`RAE`

`RMSLE`

`AIC`

`AICc`

`BIC`

See the additional metrics (disabled by default) at `?gaussian_metrics`

.

A nested `tibble`

with the **predictions** and targets.

Based on predictions of the test set,
a confusion matrix and `ROC`

curve are used to get the following:

`ROC`

:

** AUC**,

`Lower CI`

`Upper CI`

`Confusion Matrix`

:

** Balanced Accuracy**,

`F1`

`Sensitivity`

`Specificity`

`Positive Predictive Value`

`Negative Predictive Value`

`Kappa`

`Detection Rate`

`Detection Prevalence`

`Prevalence`

`MCC`

See the additional metrics (disabled by default) at
`?binomial_metrics`

.

Also includes:

A nested `tibble`

with **predictions**, predicted classes (depends on `cutoff`

), and the targets.
Note, that the predictions are *not necessarily* of the *specified* `positive`

class, but of
the *model's* positive class (second level of dependent variable, alphabetically).

The `pROC::roc`

** ROC** curve object(s).

A nested `tibble`

with the **confusion matrix**/matrices.
The `Pos_`

columns tells you whether a row is a
True Positive (`TP`

), True Negative (`TN`

),
False Positive (`FP`

), or False Negative (`FN`

),
depending on which level is the "positive" class. I.e. the level you wish to predict.

The name of the **Positive Class**.

Ludvig Renbo Olsen, r-pkgs@ludvigolsen.dk

Other validation functions:
`cross_validate_fn()`

,
`cross_validate()`

,
`validate_fn()`

```
# Attach packages
library(cvms)
library(groupdata2) # partition()
library(dplyr) # %>% arrange()
# Data is part of cvms
data <- participant.scores
# Set seed for reproducibility
set.seed(7)
# Partition data
# Keep as single data frame
# We could also have fed validate() separate train and test sets.
data_partitioned <- partition(
data,
p = 0.7,
cat_col = "diagnosis",
id_col = "participant",
list_out = FALSE
) %>%
arrange(.partitions)
# Validate a model
# Gaussian
validate(
data_partitioned,
formulas = "score~diagnosis",
partitions_col = ".partitions",
family = "gaussian",
REML = FALSE
)
# Binomial
validate(data_partitioned,
formulas = "diagnosis~score",
partitions_col = ".partitions",
family = "binomial"
)
## Feed separate train and test sets
# Partition data to list of data frames
# The first data frame will be train (70% of the data)
# The second will be test (30% of the data)
data_partitioned <- partition(
data,
p = 0.7,
cat_col = "diagnosis",
id_col = "participant",
list_out = TRUE
)
train_data <- data_partitioned[[1]]
test_data <- data_partitioned[[2]]
# Validate a model
# Gaussian
validate(
train_data,
test_data = test_data,
formulas = "score~diagnosis",
family = "gaussian",
REML = FALSE
)
```

