R: Fit a regression model to predict survey outcomes

predict_outcome_via_glm {nrba}

R Documentation

Fit a regression model to predict survey outcomes

Description

A regression model is fit to the sample data to predict outcomes measured by a survey. This model can be used to identify auxiliary variables that are predictive of survey outcomes and hence are potentially useful for nonresponse bias analysis or weighting adjustments.

Only data from survey respondents will be used to fit the model, since survey outcomes are only measured among respondents.

The function returns a summary of the model, including overall tests for each variable of whether that variable improves the model's ability to predict response status in the population of interest (not just in the random sample at hand).

Usage

predict_outcome_via_glm(
  survey_design,
  outcome_variable,
  outcome_type = "continuous",
  outcome_to_predict = NULL,
  numeric_predictors = NULL,
  categorical_predictors = NULL,
  model_selection = "main-effects",
  selection_controls = list(alpha_enter = 0.5, alpha_remain = 0.5, max_iterations = 100L)
)

Arguments

`survey_design`	A survey design object created with the `survey` package.
`outcome_variable`	Name of an outcome variable to use as the dependent variable in the model The value of this variable is expected to be `NA` (i.e. missing) for all cases other than eligible respondents.
`outcome_type`	Either `"binary"` or `"continuous"`. For `"binary"`, a logistic regression model is used. For `"continuous"`, a generalized linear model is fit using using an identity link function.
`outcome_to_predict`	Only required if `outcome_type="binary"`. Specify which category of `outcome_variable` is to be predicted.
`numeric_predictors`	A list of names of numeric auxiliary variables to use for predicting response status.
`categorical_predictors`	A list of names of categorical auxiliary variables to use for predicting response status.
`model_selection`	A character string specifying how to select a model. The default and recommended method is 'main-effects', which simply includes main effects for each of the predictor variables. The method `'stepwise'` can be used to perform stepwise selection of variables for the model. However, stepwise selection invalidates p-values, standard errors, and confidence intervals, which are generally calculated under the assumption that model specification is predetermined.
`selection_controls`	Only required if `model-selection` isn't set to `"main-effects"`. Otherwise, a list of parameters for model selection to pass on to `stepwise_model_selection`, with elements `alpha_enter`, `alpha_remain`, and `max_iterations`.

Details

See Lumley and Scott (2017) for details of how regression models are fit to survey data. For overall tests of variables, a Rao-Scott Likelihood Ratio Test is conducted (see section 4 of Lumley and Scott (2017) for statistical details) using the function regTermTest(method = "LRT", lrt.approximation = "saddlepoint") from the 'survey' package.

If the user specifies model_selection = "stepwise", a regression model is selected by adding and removing variables based on the p-value from a likelihood ratio rest. At each stage, a single variable is added to the model if the p-value of the likelihood ratio test from adding the variable is below alpha_enter and its p-value is less than that of all other variables not already in the model. Next, of the variables already in the model, the variable with the largest p-value is dropped if its p-value is greater than alpha_remain. This iterative process continues until a maximum number of iterations is reached or until either all variables have been added to the model or there are no unadded variables for which the likelihood ratio test has a p-value below alpha_enter.

Value

A data frame summarizing the fitted regression model.

Each row in the data frame represents a coefficient in the model. The column variable describes the underlying variable for the coefficient. For categorical variables, the column variable_category indicates the particular category of that variable for which a coefficient is estimated.

The columns estimated_coefficient, se_coefficient, conf_intrvl_lower, conf_intrvl_upper, and p_value_coefficient are summary statistics for the estimated coefficient. Note that p_value_coefficient is based on the Wald t-test for the coefficient.

The column variable_level_p_value gives the p-value of the Rao-Scott Likelihood Ratio Test for including the variable in the model. This likelihood ratio test has its test statistic given by the column LRT_chisq_statistic, and the reference distribution for this test is a linear combination of p F-distributions with numerator degrees of freedom given by LRT_df_numerator and denominator degrees of freedom given by LRT_df_denominator, where p is the number of coefficients in the model corresponding to the variable being tested.

References

Lumley, T., & Scott A. (2017). Fitting Regression Models to Survey Data. Statistical Science 32 (2) 265 - 278. https://doi.org/10.1214/16-STS605

Examples

library(survey)

# Create a survey design ----
data(involvement_survey_str2s, package = "nrba")

survey_design <- svydesign(
  weights = ~BASE_WEIGHT,
  strata = ~SCHOOL_DISTRICT,
  id = ~ SCHOOL_ID + UNIQUE_ID,
  fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL,
  data = involvement_survey_str2s
)

predict_outcome_via_glm(
  survey_design = survey_design,
  outcome_variable = "WHETHER_PARENT_AGREES",
  outcome_type = "binary",
  outcome_to_predict = "AGREE",
  model_selection = "main-effects",
  numeric_predictors = c("STUDENT_AGE"),
  categorical_predictors = c("STUDENT_DISABILITY_CATEGORY", "PARENT_HAS_EMAIL")
)

[Package nrba version 0.3.1 Index]