R: Select and fit a model using stepwise regression

stepwise_model_selection {nrba}

R Documentation

Select and fit a model using stepwise regression

Description

A regression model is selected by iteratively adding and removing variables based on the p-value from a likelihood ratio rest. At each stage, a single variable is added to the model if the p-value of the likelihood ratio test from adding the variable is below alpha_enter and its p-value is less than that of all other variables not already in the model. Next, of the variables already in the model, the variable with the largest p-value is dropped if its p-value is greater than alpha_remain. This iterative process continues until a maximum number of iterations is reached or until either all variables have been added to the model or there are no variables not yet in the model whose likelihood ratio test has a p-value below alpha_enter.

Stepwise model selection generally invalidates inferential statistics such as p-values, standard errors, or confidence intervals and leads to overestimation of the size of coefficients for variables included in the selected model. This bias increases as the value of alpha_enter or alpha_remain decreases. The use of stepwise model selection should be limited only to reducing a large list of candidate variables for nonresponse adjustment.

Usage

stepwise_model_selection(
  survey_design,
  outcome_variable,
  predictor_variables,
  model_type = "binary-logistic",
  max_iterations = 100L,
  alpha_enter = 0.5,
  alpha_remain = 0.5
)

Arguments

`survey_design`	A survey design object created with the `survey` package.
`outcome_variable`	The name of an outcome variable to use as the dependent variable.
`predictor_variables`	A list of names of variables to consider as predictors for the model.
`model_type`	A character string describing the type of model to fit. `'binary-logistic'` for a binary logistic regression, `'ordinal-logistic'` for an ordinal logistic regression (cumulative proportional-odds), `'normal'` for the typical model which assumes residuals follow a Normal distribution.
`max_iterations`	Maximum number of iterations to try adding new variables to the model.
`alpha_enter`	The maximum p-value allowed for a variable to be added to the model. Large values such as 0.5 or greater are recommended to reduce the bias of estimates from the selected model.
`alpha_remain`	The maximum p-value allowed for a variable to remain in the model. Large values such as 0.5 or greater are recommended to reduce the bias of estimates from the selected model.

Details

See Lumley and Scott (2017) for details of how regression models are fit to survey data. For overall tests of variables, a Rao-Scott Likelihood Ratio Test is conducted (see section 4 of Lumley and Scott (2017) for statistical details) using the function regTermTest(method = "LRT", lrt.approximation = "saddlepoint") from the 'survey' package.

See Sauerbrei et al. (2020) for a discussion of statistical issues with using stepwise model selection.

Value

An object of class svyglm representing a regression model fit using the 'survey' package.

References

Lumley, T., & Scott A. (2017). Fitting Regression Models to Survey Data. Statistical Science 32 (2) 265 - 278. https://doi.org/10.1214/16-STS605
Sauerbrei, W., Perperoglou, A., Schmid, M. et al. (2020). State of the art in selection of variables and functional forms in multivariable analysis - outstanding issues. Diagnostic and Prognostic Research 4, 3. https://doi.org/10.1186/s41512-020-00074-3

Examples

library(survey)

# Load example data and prepare it for analysis
data(involvement_survey_str2s, package = 'nrba')

involvement_survey <- svydesign(
  data = involvement_survey_str2s,
  ids = ~ SCHOOL_ID + UNIQUE_ID,
  fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL,
  strata = ~ SCHOOL_DISTRICT,
  weights = ~ BASE_WEIGHT
)

involvement_survey <- involvement_survey |>
    transform(WHETHER_PARENT_AGREES = factor(WHETHER_PARENT_AGREES))

# Fit a regression model using stepwise selection
selected_model <- stepwise_model_selection(
  survey_design = involvement_survey,
  outcome_variable = "WHETHER_PARENT_AGREES",
  predictor_variables = c("STUDENT_RACE", "STUDENT_DISABILITY_CATEGORY"),
  model_type = "binary-logistic",
  max_iterations = 100,
  alpha_enter = 0.5,
  alpha_remain = 0.5
)

[Package nrba version 0.3.1 Index]