grouped_resample {archetypal}R Documentation

Function for performing simple or Dirichlet resampling

Description

The function may be used for standard bootstrapping or for subsampling, see [1]. This function allows samples to be drawn with or without replacement, by groups and with or without Dirichlet weights, see [2]. This provides a variety of options for researchers who wish to correct sample biases, estimate empirical confidence intervals, and/or subsample large data sets.

Usage

grouped_resample(in_data = NULL, grp_vector = NULL, grp_matrix = NULL, 
                 replace = FALSE, option = "Simple", number_samples = 1, 
                 nworkers = NULL, rseed = NULL)

Arguments

in_data

The initial data frame that must be re-sampled. It must contain:

  1. an ID variable

  2. the variables of interest

  3. a grouping variable

grp_vector

The grouping variable of the data frame, defined under the name 'group' for example

grp_matrix

A matrix that contains

  1. the variable 'Group_ID' with entries all the available values of grouping variable

  2. the variable 'Resample_Size' with the sizes for each sample that will be created per grouping value

replace

A logical input: TRUE/FALSE if replacement should be used or not, respectively

option

A character input with next possible values

  1. "Simple", if we want to perform a simple re-sampling

  2. "Dirichlet", if we want to perform a Dirichlet weighted re-sampling

number_samples

The number of samples to be created. If it is greater than one, then parallel processing is used.

nworkers

The number of logical processors that will be used for parallel computing (usually it is the double of available physical cores)

rseed

The random seed that will be used for sampling. Useful for reproducible results

Value

It returns a list of mumber_samples data frames with exactly the same variables as the initial one, except that group variable has now only the given value from input data frame.

Author(s)

David Midgley

References

[1] D. N. Politis, J. P. Romano, M. Wolf, Subsampling (Springer-Verlag, New York, 1999).

[2] Baath R (2018). bayesboot: An Implementation of Rubin's (1981) Bayesian Bootstrap. R package version 0.2.2, URL https://CRAN.R-project.org/package=bayesboot

See Also

dirichlet_sample

Examples

## Load absolute temperature data set:
data("AbsoluteTemperature")
df <- AbsoluteTemperature
## Find portions for climate zones
pcs <- table(df$z)/dim(df)[1]
## Choose the approximate size of the new sample and compute resample sizes
N <- round(sqrt(nrow(AbsoluteTemperature)))
resamplesizes=as.integer(round(N*pcs))
sum(resamplesizes)
## Create the grouping matrix
groupmat <- data.frame("Group_ID"=1:4,"Resample_Size"=resamplesizes)
groupmat
## Simple resampling:
resample_simple <- grouped_resample(in_data = df, grp_vector = "z",
                                    grp_matrix = groupmat, replace = FALSE, option = "Simple",
                                    number_samples = 1, nworkers = NULL, rseed = 20191220)
cat(dim(resample_simple[[1]]),"\n")
## Dirichlet resampling:
resample_dirichlet <- grouped_resample(in_data = df, grp_vector = "z",
                                       grp_matrix = groupmat, replace = FALSE, option = "Dirichlet",
                                       number_samples = 1, nworkers = NULL, rseed = 20191220)
cat(dim(resample_dirichlet[[1]]),"\n")
##
# ## Work in parallel and create many samples
# ## Choose a random seed
# nseed <- 20191119
# ## Simple
# reslist1 <- grouped_resample(in_data = df, grp_vector = "z", grp_matrix = groupmat,
#                         replace = FALSE, option = "Simple",
#                         number_samples = 10, nworkers = NULL,
#                         rseed = nseed)
# sapply(reslist1, dim)
# ## Dirichlet
# reslist2 <- grouped_resample(in_data = df, grp_vector = "z", grp_matrix = groupmat,
#                         replace = FALSE, option = "Dirichlet",
#                         number_samples = 10, nworkers = NULL,
#                         rseed = nseed)
# sapply(reslist2, dim)
# ## Check for same rows between 1st sample of 'Simple' and 1st sample of 'Dirichlet' ...
# mapply(function(x,y){sum(rownames(x)%in%rownames(y))},reslist1,reslist2)
#

[Package archetypal version 1.3.0 Index]