as_gen_boot_design {svrep} | R Documentation |
Convert a survey design object to a generalized bootstrap replicate design
Description
Converts a survey design object to a replicate design object with replicate weights formed using the generalized bootstrap method. The generalized survey bootstrap is a method for forming bootstrap replicate weights from a textbook variance estimator, provided that the variance estimator can be represented as a quadratic form whose matrix is positive semidefinite (this covers a large class of variance estimators).
Usage
as_gen_boot_design(
design,
variance_estimator = NULL,
aux_var_names = NULL,
replicates = 500,
tau = "auto",
exact_vcov = FALSE,
psd_option = "warn",
mse = getOption("survey.replicates.mse"),
compress = TRUE
)
Arguments
design |
A survey design object created using the 'survey' (or 'srvyr') package,
with class |
variance_estimator |
The name of the variance estimator whose quadratic form matrix should be created. See variance-estimators for a detailed description of each variance estimator. Options include:
|
aux_var_names |
(Only used if |
replicates |
Number of bootstrap replicates (should be as large as possible, given computer memory/storage limitations). A commonly-recommended default is 500. |
tau |
Either |
exact_vcov |
If |
psd_option |
Either |
mse |
If |
compress |
This reduces the computer memory required to represent the replicate weights and has no impact on estimates. |
Value
A replicate design object, with class svyrep.design
, which can be used with the usual functions,
such as svymean()
or svyglm()
.
Use weights(..., type = 'analysis')
to extract the matrix of replicate weights.
Use as_data_frame_with_weights()
to convert the design object to a data frame with columns
for the full-sample and replicate weights.
Statistical Details
Let v( \hat{T_y})
be the textbook variance estimator for an estimated population total \hat{T}_y
of some variable y
.
The base weight for case i
in our sample is w_i
, and we let \breve{y}_i
denote the weighted value w_iy_i
.
Suppose we can represent our textbook variance estimator as a quadratic form: v(\hat{T}_y) = \breve{y}\Sigma\breve{y}^T
,
for some n \times n
matrix \Sigma
.
The only constraint on \Sigma
is that, for our sample, it must be symmetric and positive semidefinite.
The bootstrapping process creates B
sets of replicate weights, where the b
-th set of replicate weights is a vector of length n
denoted \mathbf{a}^{(b)}
, whose k
-th value is denoted a_k^{(b)}
.
This yields B
replicate estimates of the population total, \hat{T}_y^{*(b)}=\sum_{k \in s} a_k^{(b)} \breve{y}_k
, for b=1, \ldots B
, which can be used to estimate sampling variance.
v_B\left(\hat{T}_y\right)=\frac{\sum_{b=1}^B\left(\hat{T}_y^{*(b)}-\hat{T}_y\right)^2}{B}
This bootstrap variance estimator can be written as a quadratic form:
v_B\left(\hat{T}_y\right) =\mathbf{\breve{y}}^{\prime}\Sigma_B \mathbf{\breve{y}}
where
\boldsymbol{\Sigma}_B = \frac{\sum_{b=1}^B\left(\mathbf{a}^{(b)}-\mathbf{1}_n\right)\left(\mathbf{a}^{(b)}-\mathbf{1}_n\right)^{\prime}}{B}
Note that if the vector of adjustment factors \mathbf{a}^{(b)}
has expectation \mathbf{1}_n
and variance-covariance matrix \boldsymbol{\Sigma}
,
then we have the bootstrap expectation E_{*}\left( \boldsymbol{\Sigma}_B \right) = \boldsymbol{\Sigma}
. Since the bootstrap process takes the sample values \breve{y}
as fixed, the bootstrap expectation of the variance estimator is E_{*} \left( \mathbf{\breve{y}}^{\prime}\Sigma_B \mathbf{\breve{y}}\right)= \mathbf{\breve{y}}^{\prime}\Sigma \mathbf{\breve{y}}
.
Thus, we can produce a bootstrap variance estimator with the same expectation as the textbook variance estimator simply by randomly generating \mathbf{a}^{(b)}
from a distribution with the following two conditions:
Condition 1: \quad \mathbf{E}_*(\mathbf{a})=\mathbf{1}_n
Condition 2: \quad \mathbf{E}_*\left(\mathbf{a}-\mathbf{1}_n\right)\left(\mathbf{a}-\mathbf{1}_n\right)^{\prime}=\mathbf{\Sigma}
While there are multiple ways to generate adjustment factors satisfying these conditions,
the simplest general method is to simulate from a multivariate normal distribution: \mathbf{a} \sim MVN(\mathbf{1}_n, \boldsymbol{\Sigma})
.
This is the method used by this function.
Details on Rescaling to Avoid Negative Adjustment Factors
Let \mathbf{A} = \left[ \mathbf{a}^{(1)} \cdots \mathbf{a}^{(b)} \cdots \mathbf{a}^{(B)} \right]
denote the (n \times B)
matrix of bootstrap adjustment factors.
To eliminate negative adjustment factors, Beaumont and Patak (2012) propose forming a rescaled matrix of nonnegative replicate factors \mathbf{A}^S
by rescaling each adjustment factor a_k^{(b)}
as follows:
a_k^{S,(b)} = \frac{a_k^{(b)} + \tau - 1}{\tau}
where \tau \geq 1 - a_k^{(b)} \geq 1
for all k
in \left\{ 1,\ldots,n \right\}
and all b
in \left\{1, \ldots, B\right\}
.
The value of \tau
can be set based on the realized adjustment factor matrix \mathbf{A}
or by choosing \tau
prior to generating the adjustment factor matrix \mathbf{A}
so that \tau
is likely to be large enough to prevent negative bootstrap weights.
If the adjustment factors are rescaled in this manner, it is important to adjust the scale factor used in estimating the variance with the bootstrap replicates, which becomes \frac{\tau^2}{B}
instead of \frac{1}{B}
.
\textbf{Prior to rescaling: } v_B\left(\hat{T}_y\right) = \frac{1}{B}\sum_{b=1}^B\left(\hat{T}_y^{*(b)}-\hat{T}_y\right)^2
\textbf{After rescaling: } v_B\left(\hat{T}_y\right) = \frac{\tau^2}{B}\sum_{b=1}^B\left(\hat{T}_y^{S*(b)}-\hat{T}_y\right)^2
When sharing a dataset that uses rescaled weights from a generalized survey bootstrap, the documentation for the dataset should instruct the user to use replication scale factor \frac{\tau^2}{B}
rather than \frac{1}{B}
when estimating sampling variances.
Two-Phase Designs
For a two-phase design, variance_estimator
should be a list of variance estimators' names,
with two elements, such as list('Ultimate Cluster', 'Poisson Horvitz-Thompson')
.
In two-phase designs, only the following estimators may be used for the second phase:
"Ultimate Cluster"
"Stratified Multistage SRS"
"Poisson Horvitz-Thompson"
For statistical details on the handling of two-phase designs, see the documentation for make_twophase_quad_form.
References
The generalized survey bootstrap was first proposed by Bertail and Combris (1997).
See Beaumont and Patak (2012) for a clear overview of the generalized survey bootstrap.
The generalized survey bootstrap represents one strategy for forming replication variance estimators
in the general framework proposed by Fay (1984) and Dippo, Fay, and Morganstein (1984).
- Ash, S. (2014). "Using successive difference replication for estimating variances."
Survey Methodology, Statistics Canada, 40(1), 47–59.
- Bellhouse, D.R. (1985). "Computing Methods for Variance Estimation in Complex Surveys."
Journal of Official Statistics, Vol.1, No.3.
- Beaumont, Jean-François, and Zdenek Patak. 2012. “On the Generalized Bootstrap for Sample Surveys with Special Attention to Poisson Sampling: Generalized Bootstrap for Sample Surveys.” International Statistical Review 80 (1): 127–48. https://doi.org/10.1111/j.1751-5823.2011.00166.x.
- Bertail, and Combris. 1997. “Bootstrap Généralisé d’un Sondage.” Annales d’Économie Et de Statistique, no. 46: 49. https://doi.org/10.2307/20076068.
- Deville, J.‐C., and Tillé, Y. (2005). "Variance approximation under balanced sampling."
Journal of Statistical Planning and Inference, 128, 569–591.
- Dippo, Cathryn, Robert Fay, and David Morganstein. 1984. “Computing Variances from Complex Samples with Replicate Weights.” In, 489–94. Alexandria, VA: American Statistical Association. http://www.asasrms.org/Proceedings/papers/1984_094.pdf.
- Fay, Robert. 1984. “Some Properties of Estimates of Variance Based on Replication Methods.” In, 495–500. Alexandria, VA: American Statistical Association. http://www.asasrms.org/Proceedings/papers/1984_095.pdf.
- Matei, Alina, and Yves Tillé. (2005).
“Evaluation of Variance Approximations and Estimators
in Maximum Entropy Sampling with Unequal Probability and Fixed Sample Size.”
Journal of Official Statistics, 21(4):543–70.
See Also
Use estimate_boot_reps_for_target_cv
to help choose the number of bootstrap replicates.
For greater customization of the method, make_quad_form_matrix
can be used to
represent several common variance estimators as a quadratic form's matrix,
which can then be used as an input to make_gen_boot_factors
.
The function rescale_reps
is used to implement
the rescaling of the bootstrap adjustment factors.
See variance-estimators for a description of each variance estimator.
Examples
## Not run:
library(survey)
# Example 1: Bootstrap based on the Yates-Grundy estimator ----
set.seed(2014)
data('election', package = 'survey')
## Create survey design object
pps_design_yg <- svydesign(
data = election_pps,
id = ~1, fpc = ~p,
pps = ppsmat(election_jointprob),
variance = "YG"
)
## Convert to generalized bootstrap replicate design
gen_boot_design_yg <- pps_design_yg |>
as_gen_boot_design(variance_estimator = "Yates-Grundy",
replicates = 1000, tau = "auto")
svytotal(x = ~ Bush + Kerry, design = pps_design_yg)
svytotal(x = ~ Bush + Kerry, design = gen_boot_design_yg)
# Example 2: Bootstrap based on the successive-difference estimator ----
data('library_stsys_sample', package = 'svrep')
## First, ensure data are sorted in same order as was used in sampling
library_stsys_sample <- library_stsys_sample[
order(library_stsys_sample$SAMPLING_SORT_ORDER),
]
## Create a survey design object
design_obj <- svydesign(
data = library_stsys_sample,
strata = ~ SAMPLING_STRATUM,
ids = ~ 1,
fpc = ~ STRATUM_POP_SIZE
)
## Convert to generalized bootstrap replicate design
gen_boot_design_sd2 <- as_gen_boot_design(
design = design_obj,
variance_estimator = "SD2",
replicates = 2000
)
## Estimate sampling variances
svytotal(x = ~ TOTSTAFF, na.rm = TRUE, design = gen_boot_design_sd2)
svytotal(x = ~ TOTSTAFF, na.rm = TRUE, design = design_obj)
# Example 3: Two-phase sample ----
# -- First stage is stratified systematic sampling,
# -- second stage is response/nonresponse modeled as Poisson sampling
nonresponse_model <- glm(
data = library_stsys_sample,
family = quasibinomial('logit'),
formula = I(RESPONSE_STATUS == "Survey Respondent") ~ 1,
weights = 1/library_stsys_sample$SAMPLING_PROB
)
library_stsys_sample[['RESPONSE_PROPENSITY']] <- predict(
nonresponse_model,
newdata = library_stsys_sample,
type = "response"
)
twophase_design <- twophase(
data = library_stsys_sample,
# Identify cases included in second phase sample
subset = ~ I(RESPONSE_STATUS == "Survey Respondent"),
strata = list(~ SAMPLING_STRATUM, NULL),
id = list(~ 1, ~ 1),
probs = list(NULL, ~ RESPONSE_PROPENSITY),
fpc = list(~ STRATUM_POP_SIZE, NULL),
)
twophase_boot_design <- as_gen_boot_design(
design = twophase_design,
variance_estimator = list(
"SD2", "Poisson Horvitz-Thompson"
)
)
svytotal(x = ~ LIBRARIA, design = twophase_boot_design)
## End(Not run)