R: Set up multiple samples on a cluster

clusterSetup {simFrame}

R Documentation

Set up multiple samples on a cluster

Description

Generic function for setting up multiple samples on a cluster.

Usage

clusterSetup(cl, x, control, ...)

## S4 method for signature 'ANY,data.frame,SampleControl'
clusterSetup(cl, x, control)

Arguments

`cl`	a cluster as generated by `makeCluster`.
`x`	the `data.frame` to sample from.
`control`	a control object inheriting from the virtual class `"VirtualSampleControl"` or a character string specifying such a control class (the default being `"SampleControl"`).
`...`	if `control` is a character string or missing, the slots of the control object may be supplied as additional arguments. See `"SampleControl"` for details on the slots.

Details

A fundamental design principle of the framework in the case of design-based simulation studies is that the sampling procedure is separated from the simulation procedure. Two main advantages arise from setting up all samples in advance.

First, the repeated sampling reduces overall computation time dramatically in certain situations, since computer-intensive tasks like stratification need to be performed only once. This is particularly relevant for large population data. In close-to-reality simulation studies carried out in research projects in survey statistics, often up to 10000 samples are drawn from a population of millions of individuals with stratified sampling designs. For such large data sets, stratification takes a considerable amount of time and is a very memory-intensive task. If the samples are taken on-the-fly, i.e., in every simulation run one sample is drawn, the function to take the stratified sample would typically split the population into the different strata in each of the 10000 simulation runs. If all samples are drawn in advance, on the other hand, the population data need to be split only once and all 10000 samples can be taken from the respective strata together.

Second, the samples can be stored permanently, which simplifies the reproduction of simulation results and may help to maximize comparability of results obtained by different partners in a research project. In particular, this is useful for large population data, when complex sampling techniques may be very time-consuming. In research projects involving different partners, usually different groups investigate different kinds of estimators. If the two groups use not only the same population data, but also the same previously set up samples, their results are highly comparable.

The computational performance of setting up multiple samples can be increased by parallel computing. Since version 0.5.0, parallel computing in simFrame is implemented using the package parallel, which is part of the R base distribution since version 2.14.0 and builds upon work done for the contributed packages multicore and snow. Note that all objects and packages required for the computations (including simFrame) need to be made available on every worker process unless the worker processes are created by forking (see makeCluster).

In order to prevent problems with random numbers and to ensure reproducibility, random number streams should be used. With parallel, random number streams can be created via the function clusterSetRNGStream().

The control class "SampleControl" is highly flexible and allows stratified sampling as well as sampling of whole groups rather than individuals with a specified sampling method. Hence it is often sufficient to implement the desired sampling method for the simple non-stratified case to extend the existing framework. See "SampleControl" for some restrictions on the argument names of such a function, which should return a vector containing the indices of the sampled observations.

Nevertheless, for very complex sampling procedures, it is possible to define a control class "MySampleControl" extending "VirtualSampleControl", and the corresponding method clusterSetup(cl, x, control) with signature 'ANY, data.frame, MySampleControl'. In order to optimize computational performance, it is necessary to efficiently set up multiple samples. Thereby the slot k of "VirtualSampleControl" needs to be used to control the number of samples, and the resulting object must be of class "SampleSetup".

Value

An object of class "SampleSetup".

Methods

cl = "ANY", x = "data.frame", control = "character": set up multiple samples on a cluster using a control class specified by the character string control. The slots of the control object may be supplied as additional arguments.
cl = "ANY", x = "data.frame", control = "missing": set up multiple samples on a cluster using a control object of class "SampleControl". Its slots may be supplied as additional arguments.
cl = "ANY", x = "data.frame", control = "SampleControl": set up multiple samples on a cluster as defined by the control object control.

Author(s)

Andreas Alfons

References

Alfons, A., Templ, M. and Filzmoser, P. (2010) An Object-Oriented Framework for Statistical Simulation: The R Package simFrame. Journal of Statistical Software, 37(3), 1–36. doi: 10.18637/jss.v037.i03.

L'Ecuyer, P., Simard, R., Chen E and Kelton, W. (2002) An Object-Oriented Random-Number Package with Many Long Streams and Substreams. Operations Research, 50(6), 1073–1075.

Rossini, A., Tierney L. and Li, N. (2007) Simple Parallel Statistical Computing in R. Journal of Computational and Graphical Statistics, 16(2), 399–420.

Tierney, L., Rossini, A. and Li, N. (2009) snow: A Parallel Computing Framework for the R System. International Journal of Parallel Programming, 37(1), 78–90.

Examples

## Not run: 
# these examples require at least a dual core processor

# load data
data(eusilcP)

# start cluster
cl <- makeCluster(2, type = "PSOCK")

# load package and data on workers
clusterEvalQ(cl, {
        library(simFrame)
        data(eusilcP)
    })

# set up random number stream
clusterSetRNGStream(cl, iseed = "12345")

# simple random sampling
srss <- clusterSetup(cl, eusilcP, size = 20, k = 4)
summary(srss)
draw(eusilcP[, c("id", "eqIncome")], srss, i = 1)

# group sampling
gss <- clusterSetup(cl, eusilcP, grouping = "hid", size = 10, k = 4)
summary(gss)
draw(eusilcP[, c("hid", "id", "eqIncome")], gss, i = 2)

# stratified simple random sampling
ssrss <- clusterSetup(cl, eusilcP, design = "region",
    size = c(2, 5, 5, 3, 4, 5, 3, 5, 2), k = 4)
summary(ssrss)
draw(eusilcP[, c("id", "region", "eqIncome")], ssrss, i = 3)

# stratified group sampling
sgss <- clusterSetup(cl, eusilcP, design = "region",
    grouping = "hid", size = c(2, 5, 5, 3, 4, 5, 3, 5, 2), k = 4)
summary(sgss)
draw(eusilcP[, c("hid", "id", "region", "eqIncome")], sgss, i = 4)

# stop cluster
stopCluster(cl)

## End(Not run)

[Package simFrame version 0.5.4 Index]