stepwise_model_selection {nrba} | R Documentation |
Select and fit a model using stepwise regression
Description
A regression model is selected by iteratively adding and removing variables based on the p-value from a
likelihood ratio rest. At each stage, a single variable is added to the model if
the p-value of the likelihood ratio test from adding the variable is below alpha_enter
and its p-value is less than that of all other variables not already in the model.
Next, of the variables already in the model, the variable with the largest p-value
is dropped if its p-value is greater than alpha_remain
. This iterative process
continues until a maximum number of iterations is reached or until
either all variables have been added to the model or there are no variables
not yet in the model whose likelihood ratio test has a p-value below alpha_enter
.
Stepwise model selection generally invalidates inferential statistics
such as p-values, standard errors, or confidence intervals and leads to
overestimation of the size of coefficients for variables included in the selected model.
This bias increases as the value of alpha_enter
or alpha_remain
decreases.
The use of stepwise model selection should be limited only to
reducing a large list of candidate variables for nonresponse adjustment.
Usage
stepwise_model_selection(
survey_design,
outcome_variable,
predictor_variables,
model_type = "binary-logistic",
max_iterations = 100L,
alpha_enter = 0.5,
alpha_remain = 0.5
)
Arguments
survey_design |
A survey design object created with the |
outcome_variable |
The name of an outcome variable to use as the dependent variable. |
predictor_variables |
A list of names of variables to consider as predictors for the model. |
model_type |
A character string describing the type of model to fit.
|
max_iterations |
Maximum number of iterations to try adding new variables to the model. |
alpha_enter |
The maximum p-value allowed for a variable to be added to the model. Large values such as 0.5 or greater are recommended to reduce the bias of estimates from the selected model. |
alpha_remain |
The maximum p-value allowed for a variable to remain in the model. Large values such as 0.5 or greater are recommended to reduce the bias of estimates from the selected model. |
Details
See Lumley and Scott (2017) for details of how regression models are fit to survey data.
For overall tests of variables, a Rao-Scott Likelihood Ratio Test is conducted
(see section 4 of Lumley and Scott (2017) for statistical details)
using the function regTermTest(method = "LRT", lrt.approximation = "saddlepoint")
from the 'survey' package.
See Sauerbrei et al. (2020) for a discussion of statistical issues with using stepwise model selection.
Value
An object of class svyglm
representing
a regression model fit using the 'survey' package.
References
Lumley, T., & Scott A. (2017). Fitting Regression Models to Survey Data. Statistical Science 32 (2) 265 - 278. https://doi.org/10.1214/16-STS605
Sauerbrei, W., Perperoglou, A., Schmid, M. et al. (2020). State of the art in selection of variables and functional forms in multivariable analysis - outstanding issues. Diagnostic and Prognostic Research 4, 3. https://doi.org/10.1186/s41512-020-00074-3
Examples
library(survey)
# Load example data and prepare it for analysis
data(involvement_survey_str2s, package = 'nrba')
involvement_survey <- svydesign(
data = involvement_survey_str2s,
ids = ~ SCHOOL_ID + UNIQUE_ID,
fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL,
strata = ~ SCHOOL_DISTRICT,
weights = ~ BASE_WEIGHT
)
involvement_survey <- involvement_survey |>
transform(WHETHER_PARENT_AGREES = factor(WHETHER_PARENT_AGREES))
# Fit a regression model using stepwise selection
selected_model <- stepwise_model_selection(
survey_design = involvement_survey,
outcome_variable = "WHETHER_PARENT_AGREES",
predictor_variables = c("STUDENT_RACE", "STUDENT_DISABILITY_CATEGORY"),
model_type = "binary-logistic",
max_iterations = 100,
alpha_enter = 0.5,
alpha_remain = 0.5
)