cnorm.cv {cNORM} | R Documentation |
Cross-validation for Term Selection in cNORM
Description
Assists in determining the optimal number of terms for the regression model using repeated Monte Carlo cross-validation. It leverages an 80-20 split between training and validation data, with stratification by norm group or random sample in case of using sliding window ranking.
Usage
cnorm.cv(
data,
formula = NULL,
repetitions = 5,
norms = TRUE,
min = 1,
max = 12,
cv = "full",
pCutoff = NULL,
width = NA,
raw = NULL,
group = NULL,
age = NULL,
weights = NULL
)
Arguments
data |
Data frame of norm sample or a cnorm object. Should have ranking, powers, and interaction of L and A. |
formula |
Formula from an existing regression model; min/max functions ignored. If using a cnorm object, this is automatically fetched. |
repetitions |
Number of repetitions for cross-validation. |
norms |
If TRUE, computes norm score crossfit and R^2. Note: Computationally intensive. |
min |
Start with a minimum number of terms (default = 1). |
max |
Maximum terms in model, up to (k + 1) * (t + 1) - 1. |
cv |
"full" (default) splits data into training/validation, then ranks. Otherwise, expects a pre-ranked dataset. |
pCutoff |
Checks stratification for unbalanced data. Performs a t-test per group. Default set to 0.2 to minimize beta error. |
width |
If provided, ranking done via 'rankBySlidingWindow'. Otherwise, by group. |
raw |
Name of the raw score variable. |
group |
Name of the grouping variable. |
age |
Name of the age variable. |
weights |
Name of the weighting parameter. |
Details
Successive models, with an increasing number of terms, are evaluated, and the RMSE for raw scores plotted. This encompasses the training, validation, and entire dataset. If 'norms' is set to TRUE (default), the function will also calculate the mean norm score reliability and crossfit measures. Note that due to the computational requirements of norm score calculations, execution can be slow, especially with numerous repetitions or terms.
When 'cv' is set to "full" (default), both test and validation datasets are ranked separately, providing comprehensive cross-validation. For a more streamlined validation process focused only on modeling, a pre-ranked dataset can be used. The output comprises RMSE for raw score models, norm score R^2, delta R^2, crossfit, and the norm score SE according to Oosterhuis, van der Ark, & Sijtsma (2016).
For assessing overfitting:
CROSSFIT = R(Training; Model)^2 / R(Validation; Model)^2
A CROSSFIT > 1 suggests overfitting, < 1 suggests potential underfitting, and values around 1 are optimal, given a low raw score RMSE and high norm score validation R^2.
Suggestions for ideal model selection:
Visual inspection of percentiles with 'plotPercentiles' or 'plotPercentileSeries'.
Pair visual inspection with repeated cross-validation (e.g., 10 repetitions).
Aim for low raw score RMSE and high norm score R^2, avoiding terms with significant overfit (e.g., crossfit > 1.1).
Value
Table with results per term number: RMSE for raw scores, R^2 for norm scores, and crossfit measure.
References
Oosterhuis, H. E. M., van der Ark, L. A., & Sijtsma, K. (2016). Sample Size Requirements for Traditional and Regression-Based Norms. Assessment, 23(2), 191–202. https://doi.org/10.1177/1073191115580638
See Also
Other model:
bestModel()
,
checkConsistency()
,
derive()
,
modelSummary()
,
print.cnorm()
,
printSubset()
,
rangeCheck()
,
regressionFunction()
,
summary.cnorm()
Examples
## Not run:
# Example: Plot cross-validation RMSE by number of terms (up to 9) with three repetitions.
result <- cnorm(raw = elfe$raw, group = elfe$group)
cnorm.cv(result$data, min = 2, max = 9, repetitions = 3)
# Using a cnorm object examines the predefined formula.
cnorm.cv(result, repetitions = 1)
# For cross-validation without a cnorm model, rank data first and compute powers:
data <- rankByGroup(data = elfe, raw = "raw", group = "group")
data <- computePowers(data)
cnorm.cv(data)
# Specify formulas deliberately:
data <- rankByGroup(data = elfe, raw = "raw", group = "group")
data <- computePowers(data)
cnorm.cv(data, formula = formula(raw ~ L3 + L1A1 + L3A3 + L4 + L5))
## End(Not run)