predict_outcome_via_glm {nrba} | R Documentation |
Fit a regression model to predict survey outcomes
Description
A regression model is fit to the sample data to
predict outcomes measured by a survey.
This model can be used to identify auxiliary variables that are
predictive of survey outcomes and hence are potentially useful
for nonresponse bias analysis or weighting adjustments.
Only data from survey respondents will be used to fit the model,
since survey outcomes are only measured among respondents.
The function returns a summary of the model, including overall tests
for each variable of whether that variable improves the model's
ability to predict response status in the population of interest (not just in the random sample at hand).
Usage
predict_outcome_via_glm(
survey_design,
outcome_variable,
outcome_type = "continuous",
outcome_to_predict = NULL,
numeric_predictors = NULL,
categorical_predictors = NULL,
model_selection = "main-effects",
selection_controls = list(alpha_enter = 0.5, alpha_remain = 0.5, max_iterations = 100L)
)
Arguments
survey_design |
A survey design object created with the |
outcome_variable |
Name of an outcome variable to use as the dependent variable in the model
The value of this variable is expected to be |
outcome_type |
Either |
outcome_to_predict |
Only required if |
numeric_predictors |
A list of names of numeric auxiliary variables to use for predicting response status. |
categorical_predictors |
A list of names of categorical auxiliary variables to use for predicting response status. |
model_selection |
A character string specifying how to select a model.
The default and recommended method is 'main-effects', which simply includes main effects
for each of the predictor variables. |
selection_controls |
Only required if |
Details
See Lumley and Scott (2017) for details of how regression models are fit to survey data.
For overall tests of variables, a Rao-Scott Likelihood Ratio Test is conducted
(see section 4 of Lumley and Scott (2017) for statistical details)
using the function regTermTest(method = "LRT", lrt.approximation = "saddlepoint")
from the 'survey' package.
If the user specifies model_selection = "stepwise"
, a regression model
is selected by adding and removing variables based on the p-value from a
likelihood ratio rest. At each stage, a single variable is added to the model if
the p-value of the likelihood ratio test from adding the variable is below alpha_enter
and its p-value is less than that of all other variables not already in the model.
Next, of the variables already in the model, the variable with the largest p-value
is dropped if its p-value is greater than alpha_remain
. This iterative process
continues until a maximum number of iterations is reached or until
either all variables have been added to the model or there are no unadded variables
for which the likelihood ratio test has a p-value below alpha_enter
.
Value
A data frame summarizing the fitted regression model.
Each row in the data frame represents a coefficient in the model.
The column variable
describes the underlying variable
for the coefficient. For categorical variables, the column variable_category
indicates the particular category of that variable for which a coefficient is estimated.
The columns estimated_coefficient
, se_coefficient
,
conf_intrvl_lower
, conf_intrvl_upper
, and p_value_coefficient
are summary statistics for the estimated coefficient. Note that p_value_coefficient
is based on the Wald t-test for the coefficient.
The column variable_level_p_value
gives the p-value of the
Rao-Scott Likelihood Ratio Test for including the variable in the model.
This likelihood ratio test has its test statistic given by the column
LRT_chisq_statistic
, and the reference distribution
for this test is a linear combination of p
F-distributions
with numerator degrees of freedom given by LRT_df_numerator
and denominator degrees of freedom given by LRT_df_denominator
,
where p
is the number of coefficients in the model corresponding to
the variable being tested.
References
Lumley, T., & Scott A. (2017). Fitting Regression Models to Survey Data. Statistical Science 32 (2) 265 - 278. https://doi.org/10.1214/16-STS605
Examples
library(survey)
# Create a survey design ----
data(involvement_survey_str2s, package = "nrba")
survey_design <- svydesign(
weights = ~BASE_WEIGHT,
strata = ~SCHOOL_DISTRICT,
id = ~ SCHOOL_ID + UNIQUE_ID,
fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL,
data = involvement_survey_str2s
)
predict_outcome_via_glm(
survey_design = survey_design,
outcome_variable = "WHETHER_PARENT_AGREES",
outcome_type = "binary",
outcome_to_predict = "AGREE",
model_selection = "main-effects",
numeric_predictors = c("STUDENT_AGE"),
categorical_predictors = c("STUDENT_DISABILITY_CATEGORY", "PARENT_HAS_EMAIL")
)