R: Draw bootstrap replicates

draw.bootstrap {surveysd}

R Documentation

Draw bootstrap replicates

Description

Draw bootstrap replicates from survey data with rotating panel design. Survey information, like ID, sample weights, strata and population totals per strata, should be specified to ensure meaningfull survey bootstraping.

Usage

draw.bootstrap(
  dat,
  REP = 1000,
  hid = NULL,
  weights,
  period = NULL,
  strata = NULL,
  cluster = NULL,
  totals = NULL,
  single.PSU = c("merge", "mean"),
  boot.names = NULL,
  split = FALSE,
  pid = NULL,
  new.method = FALSE
)

Arguments

`dat`	either data.frame or data.table containing the survey data with rotating panel design.
`REP`	integer indicating the number of bootstrap replicates.
`hid`	character specifying the name of the column in `dat` containing the household id. If `NULL` (the default), the household structure is not regarded.
`weights`	character specifying the name of the column in `dat` containing the sample weights.
`period`	character specifying the name of the column in `dat` containing the sample periods. If `NULL` (the default), it is assumed that all observations belong to the same period.
`strata`	character vector specifying the name(s) of the column in `dat` by which the population was stratified. If `strata` is a vector stratification will be assumed as the combination of column names contained in `strata`. Setting in addition `cluster` not NULL stratification will be assumed on multiple stages, where each additional entry in `strata` specifies the stratification variable for the next lower stage. see Details for more information.
`cluster`	character vector specifying cluster in the data. If not already specified in `cluster` household ID is taken es the lowest level cluster.
`totals`	character specifying the name of the column in `dat` containing the the totals per strata and/or cluster. Is ONLY optional if `cluster` is `NULL` or equal `hid` and `strata` contains one columnname! Then the households per strata will be calcualted using the `weights` argument. If clusters and strata for multiple stages are specified `totals` needs to be a vector of `length(strata)` specifying the column on `dat` that contain the total number of PSUs at each stage. `totals` is interpreted from left the right, meaning that the first argument corresponds to the number of PSUs at the first and the last argument to the number of PSUs at the last stage.
`single.PSU`	either "merge" or "mean" defining how single PSUs need to be dealt with. For `single.PSU="merge"` single PSUs at each stage are merged with the strata or cluster with the next least number of PSUs. If multiple of those exist one will be select via random draw. For `single.PSU="mean"` single PSUs will get the mean over all bootstrap replicates at the stage which did not contain single PSUs.
`boot.names`	character indicating the leading string of the column names for each bootstrap replica. If NULL defaults to "w".
`split`	logical, if TRUE split households are considered using `pid`, for more information see Details.
`pid`	column in `dat` specifying the personal identifier. This identifier needs to be unique for each person throught the whole data set.
`new.method`	logical, if TRUE bootstrap replicates will never be negative even if in some strata the whole population is in the sample. WARNING: This is still experimental and resulting standard errors might be underestimated! Use this if for some strata the whole population is in the sample!

Details

draw.bootstrap takes dat and draws REP bootstrap replicates from it. dat must be household data where household members correspond to multiple rows with the same household identifier. For most practical applications, the following columns should be available in the dataset and passed via the corresponding parameters:

Column indicating the sample period (parameter period).
Column indicating the household ID (parameter hid)
Column containing the household sample weights (parameter weights);
Columns by which population was stratified during the sampling process (parameter: strata).

For single stage sampling design a column the argument totals is optional, meaning that a column of the number of PSUs at the first stage does not need to be supplied. For this case the number of PSUs is calculated and added to dat using strata and weights. By setting cluster to NULL single stage sampling design is always assumed and if strata contains of multiple column names the combination of all those column names will be used for stratification.

In the case of multi stage sampling design the argument totals needs to be specified and needs to have the same number of arguments as strata.

If cluster is NULL or does not contain hid at the last stage, hid will automatically be used as the final cluster. If, besides hid, clustering in additional stages is specified the number of column names in strata and cluster (including hid) must be the same. If for any stage there was no clustering or stratification one can set "1" or "I" for this stage.

For example ⁠strata=c("REGION","I"),cluster=c("MUNICIPALITY","HID")⁠ would speficy a 2 stage sampling design where at the first stage the municipalities where drawn stratified by regions and at the 2nd stage housholds are drawn in each municipality without stratification.

Bootstrap replicates are drawn for each survey period (period) using the function rescaled.bootstrap. Afterwards the bootstrap replicates for each household are carried forward from the first period the household enters the survey to all the censecutive periods it stays in the survey.

This ensures that the bootstrap replicates follow the same logic as the sampled households, making the bootstrap replicates more comparable to the actual sample units.

If split ist set to TRUE and pid is specified, the bootstrap replicates are carried forward using the personal identifiers instead of the houshold identifier. This takes into account the issue of a houshold splitting up. Any person in this new split household will get the same bootstrap replicate as the person that has come from an other household in the survey. People who enter already existing households will also get the same bootstrap replicate as the other households members had in the previous periods.

Value

the survey data with the number of REP bootstrap replicates added as columns.

Returns a data.table containing the original data as well as the number of REP columns containing the bootstrap replicates for each repetition.
The columns of the bootstrap replicates are by default labeled "wNumber" where Number goes from 1 to REP. If the column names of the bootstrap replicates should start with a different character or string the parameter boot.names can be used.

Author(s)

Johannes Gussenbauer, Alexander Kowarik, Statistics Austria

Examples

## Not run: 
eusilc <- demo.eusilc(prettyNames = TRUE)

## draw sample without stratification or clustering
dat_boot <- draw.bootstrap(eusilc, REP = 10, weights = "pWeight",
                           period = "year")

## use stratification w.r.t. region and clustering w.r.t. households
dat_boot <- draw.bootstrap(
  eusilc, REP = 10, hid = "hid", weights = "pWeight",
  strata = "region", period = "year")

## use multi-level clustering
dat_boot <- draw.bootstrap(
  eusilc, REP = 10, hid = "hid", weights = "pWeight",
  strata = c("region", "age"), period = "year")


# create spit households
eusilc[, pidsplit := pid]
year <- eusilc[, unique(year)]
year <- year[-1]
leaf_out <- c()
for(y in year) {
  split.person <- eusilc[
    year == (y-1) & !duplicated(hid) & !(hid %in% leaf_out),
    sample(pid, 20)
  ]
  overwrite.person <- eusilc[
    (year == (y)) & !duplicated(hid) & !(hid %in% leaf_out),
    .(pid = sample(pid, 20))
  ]
  overwrite.person[, c("pidsplit", "year_curr") := .(split.person, y)]

  eusilc[overwrite.person, pidsplit := i.pidsplit,
         on = .(pid, year >= year_curr)]
  leaf_out <- c(leaf_out,
                eusilc[pid %in% c(overwrite.person$pid,
                                  overwrite.person$pidsplit),
                unique(hid)])
}

dat_boot <- draw.bootstrap(
  eusilc, REP = 10, hid = "hid", weights = "pWeight",
  strata = c("region", "age"), period = "year", split = TRUE,
  pid = "pidsplit")
# split households were considered e.g. household and
# split household were both selected or not selected
dat_boot[, data.table::uniqueN(w1), by = pidsplit][V1 > 1]

## End(Not run)

[Package surveysd version 1.3.1 Index]