This vignette provide guidance on when the bootstrap can be used for complex survey data and how to choose a bootstrap method and number of bootstrap replicates for a survey. It further demonstrates how to use functions from the “svrep” package to implement bootstrap methods and help decide the number of bootstrap replicates needed for your data. The functions covered in this vignette include:
Functions for Creating Bootstrap Weights:
as_bootstrap_design()
, which converts a survey
design object to a replicate survey design object with bootstrap
replicate weights.
make_rwyb_bootstrap_weights()
, which creates
bootstrap replicate weights for general survey designs using the method
of Rao-Wu-Yue-Beaumont. The survey designs may potentially include
stratification, multistage clustering, and sampling with or without
replacement with unequal probabilities.
Functions for Creating Generalized Survey Bootstrap Weights:
as_gen_boot_design()
, which creates a replicate
survey design using the generalized survey bootstrap method as described
by Beaumont and Patak (2012).
make_gen_boot_factors()
, which creates a matrix of
replicate weights using the generalized survey bootstrap method. This
function will usually be used in conjunction with the function
make_quad_form_matrix()
, which returns the quadratic form
matrix representation of common variance estimators.
Functions to Help Choose the Number of Bootstrap Replicates:
estimate_boot_sim_cv()
, which estimates the
simulation error of a bootstrap estimate.
estimate_boot_reps_for_target_cv()
, which estimates
the number of bootstrap replicates needed to reduce a bootstrap’s
simulation error to a target level.
Far better an approximate answer to the right question, which is often vague, than the exact answer to the wrong question, which can always be made precise.
-- John Tukey
Ok, but what if the approximate answer is, like, really approximate or requires a whole lot of computing?
-- Survey sampling statisticians
Survey bootstrap methods are directly applicable to a wider variety of sample designs than the jackknife or balanced repeated replication (BRR). Nonetheless, complex survey designs are often shoehorned into jackknife or BRR variance estimation by pretending that the actual survey design was something simpler. The BRR method, for instance, is only applicable to samples with two clusters sampled from each stratum, but statisticians frequently use it for designs with three or more sampled clusters by grouping the actual clusters into two pseudo-clusters. For designs with a large number of sampling units in a stratum, the exact jackknife (JK1 or JKn) requires a large number of replicates and so is often replaced by the “delete-a-group jackknife (DAGJK) which randomly groups clusters into larger pseudo-clusters.
Why do statisticians go to all this effort to shoehorn their variance estimation problem into jackknife or BRR methods when they could just use the bootstrap?
The simple answer is that bootstrap methods generally require many more replicates than other methods in order to obtain a stable variance estimate. And using a large number of replicates can be a problem if you have to do a large amount of computing or if your dataset is large and you’re concerned about storage costs. Statistical agencies are particularly sensitive to these concerns when they publish microdata, since agencies often serve a large number of end-users with varying computational resources.
So why use the bootstrap?
The bootstrap tends to works well for a larger class of statistics than the jackknife. For example, for estimating the sampling variance of an estimated median or other quantiles, the jackknife tends to perform poorly but bootstrap methods at least do an adequate job.
Bootstrap methods enable different options for forming confidence intervals. For all of the standard replication methods (BRR, Jackknife, etc.), confidence intervals are generally formed using a Wald interval (\(\hat{\theta} \pm \hat{se}(\hat{\theta}) \times z_{1-\frac{\alpha}{2}}\)).1 But for certain bootstrap methods, it is possible to also form confidence intervals using other approaches, such as the bootstrap percentile method.
You can analyze your design rather than an approximation of your design, which can both reduce costs and better control errors. To use BRR for general survey designs, you have to approximate your actual survey design with a “two PSUs per stratum” design. This works surprisingly well in many cases, but it requires careful work on the part of a specially-trained statistician. For the jackknife with a large number of sampling units, you either end up with the same number of replicates as a bootstrap method or you have to randomly group your sampling units into a smaller number so you can use the DAGJK method to essentially approximate your actual survey design with a simpler one. Again, this takes careful work on the part of a specially-trained statistician.
By analyzing your design for what it is, you don’t have to pay a specially-trained statistician to meticulously approximate a design so that it can be shoehorned into a jackknife or BRR variance estimation problem, which is perhaps not the best use of a limited budget.
If the variance estimation is based on a bootstrap method tailored to the actual survey design, then the replication error in the variance estimates for key statistics is unbiased and can be quantified and controlled as a function of the number of replicates. In contrast, when variance estimation is based on approximating a design so that it can be shoehorned into a jackknife or BRR variance estimation problem, then the replication error in the variance estimates is more difficult to quantify and can consist of both noise and bias.
For most statisticians, it’s probably easier to learn. The bootstrap is the most well-known replication method among general statisticians, to the point that it’s often taught in first-year undergraduate statistics courses. So the basic idea is already familiar even to statisticians with only passing familiarity with complex survey sampling. BRR, in contrast, takes specialized training to learn and entails pre-requisite concepts such as Hadamard matrices, partial balancing, and so on. Outside of survey statistics, the jackknife tends to be much less used (and taught) compared to the bootstrap, due to its limitations with non-smooth statistics and the complexity required to make it work efficiently for large sample sizes.
Essentially every bootstrap method commonly used for surveys can be used for simple random sampling with replacement and can easily be applied to stratified sampling (simply repeat the method separately for each stratum). However, things become more complicated for other types of sampling and care is needed to use a bootstrap method appropriate to the survey design. For many common designs used in practice, it is possible (and easy!) to use one of the bootstrap methods described in the section of this vignette titled “Basic Bootstrap Methods.” If the design isn’t appropriate for one of those basic bootstrap methods, then it may be possible to use the generalized survey bootstrap described in a later section of this vignette, which covers a wide range of designs. The interested reader is encouraged to read Mashreghi, Haziza, and Léger (2016) for an overview of bootstrap methods developed for survey samples.
For most sample designs used in practice, there are three basic survey design features that must be considered when choosing a bootstrap method:
Whether there are multiple stages of sampling
Whether the design uses without-replacement sampling with large sampling fractions
Whether the design uses unequal-probability sampling (commonly referred to as “probability proportional to size (PPS)” sampling in statistics jargon)
The ‘svrep’ and ‘survey’ packages implement four basic bootstrap
methods, each of which can handle one or more of these survey design
features. Of the four methods, the Rao-Wu-Yue-Beaumont bootstrap method
(Beaumont and Émond 2022) is the only one
able to directly handle all three of these design features and is thus
the default method used in the function
as_bootstrap_design()
.2 The following table summarizes these four
basic bootstrap methods and their appropriateness for each of the common
design features described earlier.
Method | Handles Multistage Samples? | Appropriate for Without-Replacement Sampling with Large Sample Fractions? | Unequal Probability Sampling (PPS)? |
---|---|---|---|
Rao-Wu-Yue-Beaumont | Yes | Yes | Yes |
Rao-Wu | Only if first-stage sampling is without replacement or has a small sampling fraction | No | No |
Preston | Yes | Yes | No |
Canty-Davison | Only if first-stage sampling is without replacement or has a small sampling fraction | Yes | No |
Method | Strata IDs | Cluster IDs | Final Sampling Weights | Finite Population Corrections | Selection Probabilities by Stage |
---|---|---|---|---|---|
Rao-Wu-Yue-Beaumont | Yes | Yes | Yes | Yes, if sampling is without-replacement | If the design has unequal probability sampling (PPS) |
Rao-Wu | Yes (first stage only) | Yes (first stage only) | Yes | No | No |
Preston | Yes | Yes | Yes | Yes, if sampling is without replacement | No |
Canty-Davison | Yes (first-stage only) | Yes (first stage only) | Yes | Yes (first stage only), if sampling is without replacement | No |
To implement these basic bootstrap methods, we can create a survey
design object with the svydesign()
function from the survey
package, and then convert this object to a bootstrap replicate design
using as_bootstrap_design()
. This method can be used for
multistage, stratified designs with one or more different kinds of
sampling, provided the “Rao-Wu-Yue-Beaumont” method is used.
Example 1: Multistage Simple Random Sampling without Replacement (SRSWOR)
library(survey) # For complex survey analysis
library(svrep)
set.seed(2022)
# Load an example dataset from a multistage sample, with two stages of SRSWOR
data("mu284", package = 'survey')
<- svydesign(data = mu284,
multistage_srswor_design ids = ~ id1 + id2,
fpc = ~ n1 + n2)
<- as_bootstrap_design(multistage_srswor_design,
bootstrap_rep_design type = "Rao-Wu-Yue-Beaumont",
replicates = 500)
svytotal(x = ~ y1, design = multistage_srswor_design)
#> total SE
#> y1 15080 2274.3
svytotal(x = ~ y1, design = bootstrap_rep_design)
#> total SE
#> y1 15080 2311.1
Example 2: Single-stage unequal probability sampling without replacement
# Load example dataset of U.S. counties and states with 2004 Presidential vote counts
data("election", package = 'survey')
<- svydesign(data = election_pps,
pps_wor_design pps = HR(),
fpc = ~ p, # Inclusion probabilities
ids = ~ 1)
<- as_bootstrap_design(pps_wor_design,
bootstrap_rep_design type = "Rao-Wu-Yue-Beaumont",
replicates = 100)
svytotal(x = ~ Bush + Kerry, design = pps_wor_design)
#> total SE
#> Bush 64518472 2472753
#> Kerry 51202102 2426842
svytotal(x = ~ Bush + Kerry, design = bootstrap_rep_design)
#> total SE
#> Bush 64518472 2331148
#> Kerry 51202102 2337653
Example 3: Multistage Sampling with Different Sampling Methods at Each Stage
For designs that use different sampling methods at different stages,
we can use the argument samp_method_by_stage
to ensure that
the correct method is used to form the bootstrap weights. In general, if
a multistage design uses unequal probability sampling for only some
stages, then when creating the initial design object, the stage-specific
sampling probabilities should be supplied to the fpc
argument of the svydesign()
function, and the user should
specify pps = "brewer"
.
# Declare a multistage design
# where first-stage probabilities are PPSWOR sampling
# and second-stage probabilities are based on SRSWOR
<- svydesign(
multistage_design data = library_multistage_sample,
ids = ~ PSU_ID + SSU_ID,
probs = ~ PSU_SAMPLING_PROB + SSU_SAMPLING_PROB,
pps = "brewer"
)
# Convert to a bootstrap replicate design
<- as_bootstrap_design(
boot_design design = multistage_design,
type = "Rao-Wu-Yue-Beaumont",
samp_method_by_stage = c("PPSWOR", "SRSWOR"),
replicates = 1000
)
# Compare variance estimates
svytotal(x = ~ TOTCIR, na.rm = TRUE, design = multistage_design)
#> total SE
#> TOTCIR 1634739229 250890030
svytotal(x = ~ TOTCIR, na.rm = TRUE, design = boot_design)
#> total SE
#> TOTCIR 1634739229 259570779
For sample designs with additional complex features beyond the three highlighted above, the generalized survey bootstrap method can be used. This is especially useful for systematic samples or for complex designs where one wishes to use a general-purpose estimator such as the Horvitz-Thompson or Yates-Grundy estimator.
The generalized survey bootstrap is based on a remarkable observation from Fay (1984), summarized nicely by Dippo, Fay, and Morganstein (1984):
…there is no variance estimator based on sums of squares and cross-products that cannot be represented by a resampling plan.
-- Dippo, Fay, and Morganstein (1984)
In other words, if a sample design has a textbook variance estimator for totals that can be represented as a quadratic form (i.e., sums of squares and cross-products), then we can make a replication estimator out of it. Fay developed a general methodology for producing replication estimators from a textbook estimator’s quadratic form, encompassing the jackknife, bootstrap, and balanced repeated replication as special cases. Within this framework, the “generalized survey bootstrap” developed by Bertail and Combris (1997) is one specific strategy for making bootstrap replication estimators out of textbook variance estimators. See Beaumont and Patak (2012) for a clear overview of the generalized survey bootstrap.
The starting point for implementing the generalized survey bootstrap method is to choose a textbook variance estimator appropriate to the sampling design which can be represented as a quadratic form. Luckily, there are many useful variance estimators which can be represented as quadratic forms. We highlight a few prominent examples below:
For stratified, multistage cluster samples:
Highly-general variance estimators which work for any ‘measurable’ survey design (i.e., designs where every pair of units in the population has a nonzero probability of appearing in the sample). This covers most designs used in practice, with the primary exceptions being “one-PSU-per-stratum” designs and systematic sampling designs.
The Horvitz-Thompson estimator
The Sen-Yates-Grundy estimator
For systematic samples:
Once the textbook variance estimator has been selected and its quadratic form identified, the generalized survey bootstrap method consists of randomly generating each set of replicate weights from a multivariate distribution whose expectation is the \(n\)-vector \(\mathbf{1}_n\) and whose variance-covariance matrix is the matrix of the quadratic form used for the textbook variance estimator. This ensures that, in expectation, the bootstrap variance estimator for a total equals the textbook variance estimator and thus inherits properties such as design-unbiasedness and design-consistency.
In this section, we describe the generalized survey bootstrap in greater detail, using the notation of Beaumont and Patak (2012).
Let \(v( \hat{T_y})\) be the textbook variance estimator for an estimated population total \(\hat{T}_y\) of some variable \(y\). The base weight for case \(i\) in our sample is \(w_i\), and we let \(\breve{y}_i\) denote the weighted value \(w_iy_i\).
Suppose we can represent our textbook variance estimator as a quadratic form: \(v(\hat{T}_y) = \breve{y}\Sigma\breve{y}^T\), for some \(n \times n\) matrix \(\Sigma\). Our only constraint on \(\Sigma\) is that, for our sample, it must be symmetric and positive semi-definite (in other words, it should never lead to a negative variance estimate, no matter what the value of \(\breve{y}\) is).
For example, the popular Horvitz-Thompson estimator based on first-order inclusion probabilities \(\pi_k\) and second-order inclusion probabilities \(\pi_{kl}\) can be represented as a positive semi-definite matrix with entries \((1-\pi_k)\) along the main diagonal and entries \((1 - \frac{\pi_k \pi_l}{\pi_{kl}})\) everywhere else. An illustration for a sample with \(n=3\) is shown below:
\[ \Sigma_{HT} = \begin{bmatrix} (1-\pi_1) & (1 - \frac{\pi_1 \pi_2}{\pi_{12}}) & (1 - \frac{\pi_1 \pi_3}{\pi_{13}}) \\ (1 - \frac{\pi_2 \pi_1}{\pi_{21}}) & (1 - \pi_2) & (1 - \frac{\pi_2 \pi_3}{\pi_{23}}) \\ (1 - \frac{\pi_3 \pi_1}{\pi_{31}}) & (1 - \frac{\pi_3 \pi_2}{\pi_{32}}) & (1 - \pi_3) \end{bmatrix} \]
As another example, the successive-difference variance estimator for a systematic sample can be represented as a positive semi-definite matrix whose diagonal entries are all \(1\), whose superdiagonal and subdiagonal entries are all \(-1/2\), and whose top right and bottom left entries are \(-1/2\) (Ash 2014). An illustration for a sample with \(n=5\) is shown below:
\[ \Sigma_{SD2} = \begin{bmatrix} 1 & -1/2 & 0 & 0 & -1/2\\ -1/2 & 1 & -1/2 & 0 & 0 \\ 0 & -1/2 & 1 & -1/2 & 0 \\ 0 & 0 & -1/2 & 1& -1/2 \\ -1/2 & 0 & 0 & -1/2 & 1 \end{bmatrix} \]
To obtain a quadratic form matrix for a variance estimator, we can
use the function make_quad_form_matrix()
, which takes as
inputs the name of a variance estimator and relevant survey design
information. For example, the following code produces the quadratic form
matrix of the “SD2” variance estimator we saw earlier.
make_quad_form_matrix(
variance_estimator = "SD2",
cluster_ids = c(1,2,3,4,5) |> data.frame(),
strata_ids = c(1,1,1,1,1) |> data.frame(),
sort_order = c(1,2,3,4,5)
)#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 1.0 -0.5 0.0 0.0 -0.5
#> [2,] -0.5 1.0 -0.5 0.0 0.0
#> [3,] 0.0 -0.5 1.0 -0.5 0.0
#> [4,] 0.0 0.0 -0.5 1.0 -0.5
#> [5,] -0.5 0.0 0.0 -0.5 1.0
In the following example, we use this method to estimate the variance for a stratified systematic sample of U.S. public libraries. First, we create a quadratic form matrix to represent the SD2 successive-difference estimator.
# Load an example dataset of a stratified systematic sample
data('library_stsys_sample', package = 'svrep')
# Represent the SD2 successive-difference estimator as a quadratic form,
# and obtain the matrix of that quadratic form
<- make_quad_form_matrix(
sd2_quad_form variance_estimator = 'SD2',
cluster_ids = library_stsys_sample |> select(FSCSKEY),
strata_ids = library_stsys_sample |> select(SAMPLING_STRATUM),
strata_pop_sizes = library_stsys_sample |> select(STRATUM_POP_SIZE),
sort_order = library_stsys_sample |> pull("SAMPLING_SORT_ORDER")
)
class(sd2_quad_form)
#> [1] "matrix" "array"
dim(sd2_quad_form)
#> [1] 219 219
Next, we estimate the sampling variance of the estimated total for
the TOTCIR
variable using the quadratic form.
# Obtain weighted values
<- as.matrix(library_stsys_sample[['TOTCIR']] /
wtd_y 'SAMPLING_PROB']])
library_stsys_sample[[is.na(wtd_y)] <- 0
wtd_y[
# Obtain point estimate for a population total
<- sum(wtd_y)
point_estimate
# Obtain the variance estimate using the quadratic form
<- t(wtd_y) %*% sd2_quad_form %*% wtd_y
variance_estimate
# Summarize results
sprintf("Estimate: %s", point_estimate)
#> [1] "Estimate: 2195846830.39647"
sprintf("Standard Error: %s", sqrt(variance_estimate[1,1]))
#> [1] "Standard Error: 564706867.832222"
sprintf("Coefficient of Variation: %s",
round(sqrt(variance_estimate[1,1])/point_estimate, 2))
#> [1] "Coefficient of Variation: 0.26"
Our goal is to form \(B\) sets of bootstrap weights, where the \(b\)-th set of bootstrap weights is a vector of length \(n\) denoted \(\mathbf{a}^{(b)}\), whose \(k\)-th value is denoted \(a_k^{(b)}\). This gives us \(B\) replicate estimates of the population total, \(\hat{T}_y^{*(b)}=\sum_{k \in s} a_k^{(b)} \breve{y}_k\), for \(b=1, \ldots B\), from which we can easily calculate an estimate of the sampling variance.
\[ v_B\left(\hat{T}_y\right)=\frac{\sum_{b=1}^B\left(\hat{T}_y^{*(b)}-\hat{T}_y\right)^2}{B} \]
We can write this bootstrap variance estimator as a quadratic form:
\[ \begin{aligned} v_B\left(\hat{T}_y\right) &=\mathbf{\breve{y}}^{\prime}\Sigma_B \mathbf{\breve{y}} \\ \textit{where}& \\ \boldsymbol{\Sigma}_B &= \frac{\sum_{b=1}^B\left(\mathbf{a}^{(b)}-\mathbf{1}_n\right)\left(\mathbf{a}^{(b)}-\mathbf{1}_n\right)^{\prime}}{B} \end{aligned} \]
Note that if the vector of adjustment factors \(\mathbf{a}^{(b)}\) has expectation \(\mathbf{1}_n\) and variance-covariance matrix \(\boldsymbol{\Sigma}\), then we have the bootstrap expectation \(E_{*}\left( \boldsymbol{\Sigma}_B \right) = \boldsymbol{\Sigma}\). Since the bootstrap process takes the sample values \(\breve{y}\) as fixed, the bootstrap expectation of the variance estimator is \(E_{*} \left( \mathbf{\breve{y}}^{\prime}\Sigma_B \mathbf{\breve{y}}\right)= \mathbf{\breve{y}}^{\prime}\Sigma \mathbf{\breve{y}}\). Thus, we can produce a bootstrap variance estimator with the same expectation as the textbook variance estimator simply by randomly generating \(\mathbf{a}^{(b)}\) from a distribution with the following two conditions:
Condition 1: \(\quad \mathbf{E}_*(\mathbf{a})=\mathbf{1}_n\)
Condition 2: \(\quad \mathbf{E}_*\left(\mathbf{a}-\mathbf{1}_n\right)\left(\mathbf{a}-\mathbf{1}_n\right)^{\prime}=\mathbf{\Sigma}\)
The simplest, general way to generate the adjustment factors is to simulate from a multivariate normal distribution \(\mathbf{a} \sim MVN(\mathbf{1}_n, \boldsymbol{\Sigma})\), which is the method used in this package. However, this method can lead to negative adjustment factors and hence negative bootstrap weights, which–while perfectly valid for variance estimation–may be undesirable from a practical point of view. Thus, in the following subsection, we describe one method for adjusting the replicate factors so that they are nonnegative and \(E_{*} \left( \mathbf{\breve{y}}^{\prime}\Sigma_B \mathbf{\breve{y}}\right) =\mathbf{\breve{y}}^{\prime}\Sigma \mathbf{\breve{y}}\).
Let \(\mathbf{A} = \left[ \mathbf{a}^{(1)} \cdots \mathbf{a}^{(b)} \cdots \mathbf{a}^{(B)} \right]\) denote the \((n \times B)\) matrix of bootstrap adjustment factors. To eliminate negative adjustment factors, Beaumont and Patak (2012) propose forming a rescaled matrix of nonnegative replicate factors \(\mathbf{A}^S\) by rescaling each adjustment factor \(a_k^{(b)}\) as follows:
\[ \begin{aligned} a_k^{S,(b)} &= \frac{a_k^{(b)} + \tau - 1}{\tau} \\ \textit{where } \tau &\geq 1 - a_k^{(b)} \geq 1 \\ &\textit{for all }k \textit{ in } \left\{ 1,\ldots,n \right\} \\ &\textit{and all }b \textit{ in } \left\{1, \ldots, B\right\} \\ \end{aligned} \]
The value of \(\tau\) can be set based on the realized adjustment factor matrix \(\mathbf{A}\) or by choosing \(\tau\) prior to generating the adjustment factor matrix \(\mathbf{A}\) so that \(\tau\) is likely to be large enough to prevent negative bootstrap weights.
If the adjustment factors are rescaled in this manner, it is important to adjust the scale factor used in estimating the variance with the bootstrap replicates, which becomes \(\frac{\tau^2}{B}\) instead of \(\frac{1}{B}\).
\[ \begin{aligned} \textbf{Prior to rescaling: } v_B\left(\hat{T}_y\right) &= \frac{1}{B}\sum_{b=1}^B\left(\hat{T}_y^{*(b)}-\hat{T}_y\right)^2 \\ \textbf{After rescaling: } v_B\left(\hat{T}_y\right) &= \frac{\tau^2}{B}\sum_{b=1}^B\left(\hat{T}_y^{S*(b)}-\hat{T}_y\right)^2 \\ \end{aligned} \]
When sharing a dataset that uses rescaled weights from a generalized survey bootstrap, the documentation for the dataset should instruct the user to use replication scale factor \(\frac{\tau^2}{B}\) rather than \(\frac{1}{B}\) when estimating sampling variances.
There are two ways to implement the generalized survey bootstrap.
The simplest method is to convert an existing survey design object to a generalized bootstrap design.
With this approach, we create a survey design object using the
svydesign()
function. This allows us to represent
information about stratification and clustering (potentially at multiple
stages), as well as information about finite population corrections.
# Load example data from stratified systematic sample
data('library_stsys_sample', package = 'svrep')
# First, ensure data are sorted in same order as was used in sampling
<- library_stsys_sample[
library_stsys_sample order(library_stsys_sample$SAMPLING_SORT_ORDER),
]
# Create a survey design object
<- svydesign(
design_obj data = library_stsys_sample,
strata = ~ SAMPLING_STRATUM,
ids = ~ 1,
fpc = ~ STRATUM_POP_SIZE
)
Next, we convert the survey design object to a replicate design using
the function as_gen_boot_design()
. The function argument
variance_estimator
allows us to specify the name of a
variance estimator to use as the basis for creating replicate
weights.
# Convert to generalized bootstrap replicate design
<- as_gen_boot_design(
gen_boot_design_sd2 design = design_obj,
variance_estimator = "SD2",
replicates = 2000
)#> For `variance_estimator='SD2', assumes rows of data are sorted in the same order used in sampling.
# Estimate sampling variances
svymean(x = ~ TOTSTAFF, na.rm = TRUE, design = gen_boot_design_sd2)
#> mean SE
#> TOTSTAFF 19.756 4.2644
For a PPS design that uses the Horvitz-Thompson or Yates-Grundy estimator, we can create a generalized bootstrap estimator with the same expectation. In the example below, we create a PPS design with the ‘survey’ package and convert it to a generalied bootstrap design.
# Load example data of a PPS survey of counties and states
data('election', package = 'survey')
# Create survey design object
<- svydesign(
pps_design_ht data = election_pps,
id = ~1, fpc = ~p,
pps = ppsmat(election_jointprob),
variance = "HT"
)
# Convert to generalized bootstrap replicate design
<- pps_design_ht |>
gen_boot_design_ht as_gen_boot_design(variance_estimator = "Horvitz-Thompson",
replicates = 5000, tau = "auto")
# Compare sampling variances from bootstrap vs. Horvitz-Thompson estimator
svytotal(x = ~ Bush + Kerry, design = pps_design_ht)
#> total SE
#> Bush 64518472 2604404
#> Kerry 51202102 2523712
svytotal(x = ~ Bush + Kerry, design = gen_boot_design_ht)
#> total SE
#> Bush 64518472 2627361
#> Kerry 51202102 2499754
We can also use the generalized bootstrap for designs that use multistage, stratified simple random sampling without replacement.
library(dplyr) # For data manipulation
# Create a multistage survey design
<- svydesign(
multistage_design data = library_multistage_sample |>
mutate(Weight = 1/SAMPLING_PROB),
ids = ~ PSU_ID + SSU_ID,
fpc = ~ PSU_POP_SIZE + SSU_POP_SIZE,
weights = ~ Weight
)
# Convert to a generalized bootstrap design
<- as_gen_boot_design(
multistage_boot_design design = multistage_design,
variance_estimator = "Stratified Multistage SRS"
)
# Compare variance estimates
svytotal(x = ~ TOTCIR, na.rm = TRUE, design = multistage_design)
#> total SE
#> TOTCIR 1634739229 251589313
svytotal(x = ~ TOTCIR, na.rm = TRUE, design = multistage_boot_design)
#> total SE
#> TOTCIR 1634739229 236364980
Unless specified otherwise, as_gen_boot_design()
automatically selects a rescaling value \(\tau\) to use for eliminating negative
adjustment factors. The scale
attribute of the resulting
replicate survey design object is thus set to equal \(\tau^2/B\). The specific value of \(\tau\) can be retrieved from the replicate
design object, as follows.
# View overall scale factor
<- gen_boot_design_ht$scale
overall_scale_factor print(overall_scale_factor)
#> [1] 0.00464648
# Check that the scale factor was calculated correctly
<- gen_boot_design_ht$tau
tau print(tau)
#> [1] 4.82
<- ncol(gen_boot_design_ht$repweights)
B print(B)
#> [1] 5000
print( (tau^2) / B )
#> [1] 0.00464648
The generalized survey bootstrap can be implemented with a two-step process:
make_quad_form_matrix()
to
represent the variance estimator as a quadratic form’s matrix.# Load an example dataset of a stratified systematic sample
data('library_stsys_sample', package = 'svrep')
# Represent the SD2 successive-difference estimator as a quadratic form,
# and obtain the matrix of that quadratic form
<- make_quad_form_matrix(
sd2_quad_form variance_estimator = 'SD2',
cluster_ids = library_stsys_sample |> select(FSCSKEY),
strata_ids = library_stsys_sample |> select(SAMPLING_STRATUM),
strata_pop_sizes = library_stsys_sample |> select(STRATUM_POP_SIZE),
sort_order = library_stsys_sample |> pull("SAMPLING_SORT_ORDER")
)
make_gen_boot_factors()
to
generate replicate factors based on the target quadratic form. The
function argument tau
can be used to avoid negative
adjustment factors using the previously-described method. The actual
value of tau
used can be extracted from the function’s
output using the attr()
function.<- make_gen_boot_factors(
rep_adj_factors Sigma = sd2_quad_form,
num_replicates = 500,
tau = "auto"
)
<- attr(rep_adj_factors, 'tau')
tau <- ncol(rep_adj_factors) B
Using the adjustment factors thus created, we can create a replicate
survey design object by using the function svrepdesign()
,
with arguments type = "other"
and specifying the
scale
argument to use the factor \(\tau^2/B\).
<- svrepdesign(
gen_boot_design data = library_stsys_sample |>
mutate(SAMPLING_WEIGHT = 1/SAMPLING_PROB),
repweights = rep_adj_factors,
weights = ~ SAMPLING_WEIGHT,
combined.weights = FALSE,
type = "other",
scale = tau^2/B,
rscales = rep(1, times = B)
)
This allows us to estimate sampling variances, even for quite complex sampling designs.
|>
gen_boot_design svymean(x = ~ TOTSTAFF,
na.rm = TRUE, deff = TRUE)
#> mean SE DEff
#> TOTSTAFF 19.7565 4.2024 0.97
The bootstrap suffers from unavoidable “simulation error” (also referred to as “Monte Carlo” error) caused by using a finite number of replicates to simulate the ideal bootstrap estimate we would obtain if we used an infinite number of replicates. In general, the simulation error can be reduced by using a larger number of bootstrap replicates.
While there are many rule-of-thumb values for the number of replicates that should be used (some say 500, others say 1,000), it is advisable to instead use a principled strategy for choosing the number of replicates. One general strategy proposed by Beaumont and Patak (2012) is as follows:
Step 1: Determine the largest acceptable level of simulation error for key survey estimates. For example, one might determine that, on average, the bootstrap standard error estimate should be no more than \(\pm 5\%\) different than the ideal bootstrap estimate.
Step 2: Estimate key statistics of interest
using a large number of bootstrap replicates (such as 5,000) and save
the estimates from each bootstrap replicate. This can be conveniently
done using a function from the ‘survey’ package such as
svymean(..., return.replicates = TRUE)
or
withReplicates(..., return.replicates = TRUE)
.
Step 3: Estimate the minimum number of bootstrap
replicates needed to reduce the level of simulation error to the target
level. This can be done using the ‘svrep’ function
estimate_boot_reps_for_target_cv()
.
Simulation error can be measured as a “simulation coefficient of variation” (CV), which is the ratio of the standard error of a bootstrap estimator to the expectation of that bootstrap estimator, where the expectation and standard error are evaluated with respect to the bootstrapping process given the selected sample.
For a statistic \(\hat{\theta}\), the simulation CV of the bootstrap variance estimator \(v_{B}(\hat{\theta})\) based on \(B\) replicate estimates \(\hat{\theta}^{\star}_1,\dots,\hat{\theta}^{\star}_B\) is defined as follows:
\[\begin{aligned} CV_{\star}(v_{B}(\hat{\theta})) &= \frac{\sqrt{var_{\star}(v_B(\hat{\theta}))}}{E_{\star}(v_B(\hat{\theta}))} = \frac{CV_{\star}(E_2)}{\sqrt{B}} \\ \textit{where:}& \\ E_2 &= (\hat{\theta}^{\star} - \hat{\theta})^2 \\ CV_{\star}(E_2) &= \frac{\sqrt{var_{\star}(E_2)}}{E_{\star}(E_2)} \\ \textit{and }& var_{\star} \text{ and } E_{\star} \textit{ are evaluated} \\ & \textit{with respect to the bootstrapping process} \\ & \textit{given the selected sample} \end{aligned}\]The simulation CV of a statistic, denoted \(CV_{\star}(v_{B}(\hat{\theta}))\), can be estimated for a given number of replicates \(B\) by estimating \(CV_{\star}(E_2)\) using observed values and dividing this by \(\sqrt{B}\). As a result, one can thereby estimate the number of bootstrap replicates needed to obtain a target simulation CV, which is a useful strategy for determining the number of bootstrap replicates to use for a survey.
With the ‘svrep’ package, it is possible to estimate the number of bootstrap replicates required to obtain a target simulation CV for a statistic.
library(survey)
data('api', package = 'survey')
# Declare a bootstrap survey design object ----
<- svydesign(
boot_design data = apistrat,
weights = ~pw,
id = ~1,
strata = ~stype,
fpc = ~fpc
|>
) ::as_bootstrap_design(replicates = 5000)
svrep
# Produce estimates of interest, and save the estimate from each replicate ----
<- svymean(x = ~ api00 + api99 + stype,
estimated_means_and_proportions design = boot_design,
return.replicates = TRUE)
# Estimate the number of replicates needed to obtain a target simulation CV ----
estimate_boot_reps_for_target_cv(
svrepstat = estimated_means_and_proportions,
target_cv = c(0.01, 0.05, 0.10)
)#> TARGET_CV MAX_REPS api00 api99 stypeE stypeH stypeM
#> 1 0.01 15068 6651 6652 15068 15068 15068
#> 2 0.05 603 267 267 603 603 603
#> 3 0.10 151 67 67 151 151 151
To estimate the simulation CV for the current number of replicates
used, it is possible to use the function
estimate_boot_sim_cv()
.
estimate_boot_sim_cv(estimated_means_and_proportions)
#> STATISTIC SIMULATION_CV N_REPLICATES
#> 1 api00 0.01153322 5000
#> 2 api99 0.01153350 5000
#> 3 stypeE 0.01735956 5000
#> 4 stypeH 0.01735950 5000
#> 5 stypeM 0.01735951 5000
Non-Wald methods are typically only used in practice for confidence intervals for proportions. These non-Wald methods are normally based on the point estimate, the standard error, and a measure of effective sample size.↩︎
Preston’s method can handle both multistage samples and without-replacement sampling with large sample fractions, but it is not strictly applicable to designs with unequal probability sampling. The Rao-Wu method is a special case of the Rao-Wu-Yue-Beaumont method and only applies to designs where the first stage of sampling is with replacement or where the first-stage sample fraction is small enough to be ignored. The Canty-Davison method is only applicable to single-stage designs with simple random sampling (with or without replacement) within strata.↩︎