Stage.1 {NormData}R Documentation

Stage 1 of the regression-based normative analysis

Description

The function Stage.1 fits a regression model with the specified mean and residual variance components, and conducts several model checks (homoscedasticity, normality, absence of outliers, and multicollinearity) that are useful in a setting where regression-based normative data have to be established.

Usage

Stage.1(Dataset, Model, Order.Poly.Var=3, 
Alpha=0.05, Alpha.Homosc=0.05, Alpha.Norm = .05, 
Assume.Homoscedasticity=NULL,
Test.Assumptions=TRUE, Outlier.Cut.Off=4, 
Show.VIF=TRUE, GVIF.Threshold=10, Sandwich.Type="HC0", 
Alpha.CI.Group.Spec.SD.Resid=0.01)

Arguments

Dataset

A data.frame that should consist of one line per test participant (the so-called ‘wide’ data-format). Each line should contain (at least) one test score and one independent variable.

Model

The regression model to be fitted (mean structure). A formula should be provided using the syntaxis of the lm function (for help, see ?lm). For example, Test.Score~Gender will fit a linear regression model in which Gender (the independent variable) is regressed on Test.Score. Test.Score~Gender+Age+ Gender:Age will regress Test.Score on Gender, Age, and the interaction term. Test.Score~1 will fit an intercept-only model.

Order.Poly.Var

If the homoscedasticity assumption is violated and the mean structure of the fitted model contains at least one quantitative variable, a polynomial variance prediction function is fitted. The argument Order.Poly.Var= determines the order of the polynomial, e.g., Order.Poly.Var=1, Order.Poly.Var=2, Order.Poly.Var=3 for linear, quadratic and cubic polynomials, respectively. By default, Order.Poly.Var = 3.

Alpha

The significance level to be used when conducting inference for the mean structure of the model. Default Alpha=0.05.

Alpha.Homosc

The significance level to be used to evaluate the homoscedasticity assumption based on the Levene test (when all independent variables in the model are qualitative) or the Breusch-Pagan test (when at least one of the independent variables is quantitative). Default Alpha.Homosc=0.05.

Alpha.Norm

The significance level to be used to test the normality assumption for the standardized errors using the Shapiro-Wilk test. The normality assumption is evaluated based on the standardized residuals in the normative dataset, which are computed as explained in the Assume.Homoscedasticity= argument documentation below. Default Alpha.Shapiro=0.05.

Assume.Homoscedasticity

Logical. The NormData package ‘decides’ whether the homoscedasticity assumption is valid based on the Levene or Breusch-Pagan tests (for models that only include qualitative independent variables versus models that include at least one quantitative independent variable, respectively). The Assume.Homoscedasticity= TRUE/FALSE argument can be used to overrule this decision process and ‘force’ the NormData package to assume or not assume homoscedasticity. When the argument
Assume.Homoscedasticity=TRUE is used, the argument Alpha.Homosc=0 is automatically used in the Stage.1() function call and thus the homoscedasticity assumption will never be rejected (because the p-value of the Levene or Breusch-Pagan test-statistics will always be larger than the specified \alpha=0). When Assume.Homoscedasticity=FALSE is used, the argument Alpha.Homosc=1 is automatically used thus the homoscedasticity assumption will always be rejected (because the p-value of the Levene or Breusch-Pagan test-statistics will always be smaller than the specified \alpha=1).

By default, the standardized residuals \widehat{\delta}_i that are shown in the normality and outlier output sections of the results (and the plots, see plot Stage.1) are computed based on the overall residual standard error when the homoscedasticity assumption is valid (i.e., as \widehat{\delta}_i = \frac{\widehat{\varepsilon}_i}{\widehat{\sigma}^2_{\varepsilon}}, with \widehat{\sigma}^2_{\varepsilon} corresponding to the overall residual standard error), or based on prediction-specific residual standard errors when the homoscedasticity assumption is invalid (i.e., as \widehat{\delta}_i = \frac{\widehat{\varepsilon}_i}{\widehat{\sigma}^2_{\varepsilon_i}}, with \widehat{\sigma}^2_{\varepsilon_i} corresponding to e.g., a cubic polynomial variance prediction function \widehat{\sigma}^2_{\varepsilon_i} = \widehat{\gamma}_0 + \widehat{\gamma}_1 \: \widehat{Y} + \widehat{\gamma}_2 \: \widehat{Y}^2 + {\gamma}_3 \: \widehat{Y}^3 when the mean structure of the model contains quantitiative independent variables).

Test.Assumptions

Logical. Should the model assumptions be evaluated for the specified model? Default Test.Assumptions=TRUE.

Outlier.Cut.Off

Outliers are evaluated based on the standardized residuals, which are computed as explained in the Assume.Homoscedasticity= argument documentation. The Outlier.Cut.Off= argument specifies the absolute value that is used as a threshold to detect outliers. Default Outlier.Cut.Off=4, so test scores with standardized residuals < -4 or > 4 are flagged as outliers.

Show.VIF

Logical. Should the generalized VIF (Fox and Monette, 1992) be shown when the function summary() is applied to the fitted object? Default Show.VIF=TRUE. If all names of the independent variables in the fitted Stage 1 model contain the string ‘Age’ (e.g., Age, Age.2 and Age.3), a higher-order polynomial model for the mean structure is being fitted. For such models, multicollinearity diagnostics are essentially irrelevant (see Van der Elst, 2023) and in such cases the generalized VIF is not printed in the summary output. The generalized VIF is also not shown whenn there is only one independent variable in the model (because multicollinearity relates to the linear association of two or more independent variables).

GVIF.Threshold

The threshold value to be used to detect multicollinearity based on the generalized VIF. Default GVIF.Threshold=10.

Sandwich.Type

When the homoscedasticity assumption is violated, so-called sandwich estimators (or heteroscedasticity-consistent estimators) for the standard errors of the regression parameters are used. For example, the sandwich estimator for the standard error of \widehat{\beta}_1 in a simple linear regression model corresponds to \widehat{\sigma}_{{\beta}_1}=\sqrt{\frac{ \sum\limits_{i=1}^{N}\left(\left(X_i - \widehat{\mu}_{X_{i}}\right)^2 \: \widehat{\varepsilon}_i^2 \right)}{\left(\sum\limits_{i=1}^{N}(X_i - \widehat{\mu}_{X_{i}})^2\right)^2}}. For multiple linear regression models, the sandwich estimators for the different independent variables \widehat{\sigma}_{{\beta}_0}, \widehat{\sigma}_{{\beta}_1}, ...correspond to the square roots of the diagonal elements of \boldsymbol{\widehat{\Sigma}}_{\beta} = \\ \left(\boldsymbol{X}^{'}\boldsymbol{X}\right)^{-1} \left(\boldsymbol{X}^{'} \left[\begin{array}{cccc} \widehat{\varepsilon}^2_1 & 0 & \ldots & 0\\ 0 & \widehat{\varepsilon}^2_2 & \ldots & 0\\ \vdots & \vdots & \ddots & 0\\ 0 & 0 & 0 & \widehat{\varepsilon}^2_N \end{array}\right] \boldsymbol{X}\right) \left(\boldsymbol{X}^{'}\boldsymbol{X}\right)^{-1}. The sandwich-estimators that are shown in the above expressions are referred to as the Heteroscedasticity-Consistent 0 estimator (or HC0 estimator), which is the first sandwich-estimator that was proposed in the literature. The HC0 sandwich-estimator is justified based on asymptotic theory, and its application thus requires large sample sizes. For smaller sample sizes of N < 250, the use of the HC3 estimator is recommended because the HC0 sandwich-estimator tends to be negatively biased (Long and Erwin, 2000). By default, the HC0 estimator is used. The argument Sandwich.Type= can be used to request another type of the heteroscedasticity-consistent estimator. For details on these estimators, see the vcovHC function of the sandwich package. If N < 250 and the homoscedasticity assumption is violated, a note will be given that the use of the HC3-estimator is recommended. To this end, the argument Sandwich.Type="HC3" can be added in the Stage.1() function call.

Alpha.CI.Group.Spec.SD.Resid

The \alpha-level to be used for the CIs around the prediction-specific residual standard errors (when the homoscedasticity assumption is invalid and the model only contains qualitative independent variable). These CIs are used in the variance function plot. Default Alpha.CI.Group.Spec.SD.Resid=0.01.

Details

For details, see Van der Elst (2023).

Value

An object of class Stage.1 with components,

HomoNorm

The fitted regression model assuming homoscedasticity and normality.

NoHomoNorm

The fitted regression model assuming no homoscedasticity and normality.

HomoNoNorm

The fitted regression model assuming homoscedasticity and no normality.

NoHomoNoNorm

The fitted regression model assuming no homoscedasticity and no normality.

Predicted

The predicted test scores based on the fitted model.

Sandwich.Type

The requested sandwich estimator.

Order.Poly.Var

The order of the polynomial variance prediction function.

Author(s)

Wim Van der Elst

References

Fox, J. and Monette, G. (1992). Generalized collinearity diagnostics. JASA, 87, 178-183.

Long, J. S. and Ervin, L. H. (2000). Using Heteroscedasticity Consistent Standard Errors in the Linear Regression Model. The American Statistician, 54, 217-224.

Van der Elst, W. (2024). Regression-based normative data for psychological assessment: A hands-on approach using R. Springer Nature.

See Also

plot Stage.1, Stage.2.AutoScore, Stage.2.NormScore, Stage.2.NormTable

Examples

# Replicate the Stage 1 results that were obtained in 
# Case study 1 of Chapter 4 in Van der Elst (2023)
# ---------------------------------------------------
library(NormData)   # load the NormData package
data(GCSE)          # load the GCSE dataset

# Conduct the Stage 1 analysis
Model.1.GCSE <- Stage.1(Dataset=GCSE, 
    Model=Science.Exam~Gender)

summary(Model.1.GCSE)
plot(Model.1.GCSE)


# Replicate the Stage 1 results that were obtained in 
# Case study 1 of Chapter 7 in Van der Elst (2023)
# ---------------------------------------------------
library(NormData)   # load the NormData package
data(Substitution)  # load the Substitution dataset

# Add the variable Age.C (= Age centered) and its 
# quadratic and cubic terms to the Substitution dataset
Substitution$Age.C <- Substitution$Age - 50
Substitution$Age.C2 <- (Substitution$Age - 50)**2
Substitution$Age.C3 <- (Substitution$Age - 50)**3

# Fit the full Stage 1 model
Substitution.Model.1 <- Stage.1(Dataset=Substitution,
   Model=LDST~Age.C+Age.C2+Age.C3+Gender+LE+Age.C:LE+
   Gender:LE+Age.C:Gender, Alpha=0.005)
summary(Substitution.Model.1)

# Fit the model in which the non-significant Age.C:Gender
# interaction term is removed
Substitution.Model.2 <- Stage.1(Dataset=Substitution, 
    Alpha=0.005,
    Model=LDST~Age.C+Age.C2+Age.C3+Gender+LE+
    Age.C:LE+Gender:LE)
summary(Substitution.Model.2)

# Evaluate the significance of the Gender:LE interaction term
# GLT is used because the interaction involves multiple regression
# parameters
GLT.1 <- GLT(Dataset=Substitution, Alpha=0.005, 
   Unrestricted.Model=LDST~Age.C+Age.C2+Age.C3+
      Gender+LE+Age.C:LE+Gender:LE, 
   Restricted.Model=LDST~Age.C+Age.C2+Age.C3+
      Gender+LE+Age.C:LE)
summary(GLT.1)

# Fit the model in which the non-significant Gender:LE
# interaction term is removed
Substitution.Model.3 <- Stage.1(Dataset=Substitution, 
    Alpha=0.005,
    Model=LDST~Age.C+Age.C2+Age.C3+Gender+LE+Age.C:LE)
summary(Substitution.Model.3)

# Evaluate the significance of the Age:LE interaction
# using the General Linear Test framework
GLT.2 <- GLT(Dataset=Substitution,
    Unrestricted.Model=LDST~Age.C+Age.C2+Age.C3+Gender+LE+Age.C:LE,
    Restricted.Model=LDST~Age.C+Age.C2+Age.C3+Gender+LE, Alpha=0.005)
summary(GLT.2)

# Fit the model in which the non-significant Age_c:LE
# interaction term is removed
Substitution.Model.4 <- Stage.1(Dataset=Substitution,
   Alpha=0.005, Model=LDST~Age.C+Age.C2+Age.C3+Gender+LE)
summary(Substitution.Model.4)

# Fit the model in which the non-significant Age.C3 term is removed
Substitution.Model.5 <- Stage.1(Dataset=Substitution,
   Alpha=0.005, Model=LDST~Age.C+Age.C2+Gender+LE)
summary(Substitution.Model.5)

# Fit the model in which the non-significant Age.C2 term is removed
Substitution.Model.6 <- Stage.1(Dataset=Substitution,
   Alpha=0.005, Model=LDST~Age.C+Gender+LE)
summary(Substitution.Model.6)

# Fit the model in which the non-significant main effect of Gender 
# is removed
Substitution.Model.7 <- Stage.1(Dataset=Substitution, 
  Alpha=0.005, Model=LDST~Age.C+LE)
summary(Substitution.Model.7)
plot(Substitution.Model.7, Normality = FALSE, Outliers = FALSE)

# Check the significance of LE using the GLT framework
GLT.3 <- GLT(Dataset=Substitution, Alpha=0.005,
    Unrestricted.Model=LDST~Age.C+LE, 
    Restricted.Model=LDST~Age.C)
summary(GLT.3)

# Residual variance function. Substitution.Model.7 uses
# a cubic polynomial variance prediction function. 
# Remove cubic Pred.Y term from Substitution.Model.7, so
# fit quadratic variance prediction function
Substitution.Model.8 <- Stage.1(Dataset=Substitution, 
    Alpha=0.005, Model=LDST~Age.C+LE,
    Order.Poly.Var=2)  # Order.Poly.Var=2 specifies a quadratic polynomial
                       # for the variiance prediction function
summary(Substitution.Model.8)
plot(Substitution.Model.8, Normality = FALSE, Outliers = FALSE)

# Remove quadratic Pred.Y term, so fit linear variance 
# prediction function
Substitution.Model.9 <- Stage.1(Dataset=Substitution, 
    Alpha=0.005, Model=LDST~Age.C+LE,
    Order.Poly.Var=1) # Order.Poly.Var=1 specifies a linear polynomial
                      # for the variiance prediction function

# Final Stage 1 model
summary(Substitution.Model.9)
plot(Substitution.Model.9) 

[Package NormData version 1.1 Index]