clusterSetup {simFrame} | R Documentation |
Set up multiple samples on a cluster
Description
Generic function for setting up multiple samples on a cluster.
Usage
clusterSetup(cl, x, control, ...)
## S4 method for signature 'ANY,data.frame,SampleControl'
clusterSetup(cl, x, control)
Arguments
cl |
a cluster as generated by |
x |
the |
control |
a control object inheriting from the virtual class
|
... |
if |
Details
A fundamental design principle of the framework in the case of design-based simulation studies is that the sampling procedure is separated from the simulation procedure. Two main advantages arise from setting up all samples in advance.
First, the repeated sampling reduces overall computation time dramatically in certain situations, since computer-intensive tasks like stratification need to be performed only once. This is particularly relevant for large population data. In close-to-reality simulation studies carried out in research projects in survey statistics, often up to 10000 samples are drawn from a population of millions of individuals with stratified sampling designs. For such large data sets, stratification takes a considerable amount of time and is a very memory-intensive task. If the samples are taken on-the-fly, i.e., in every simulation run one sample is drawn, the function to take the stratified sample would typically split the population into the different strata in each of the 10000 simulation runs. If all samples are drawn in advance, on the other hand, the population data need to be split only once and all 10000 samples can be taken from the respective strata together.
Second, the samples can be stored permanently, which simplifies the reproduction of simulation results and may help to maximize comparability of results obtained by different partners in a research project. In particular, this is useful for large population data, when complex sampling techniques may be very time-consuming. In research projects involving different partners, usually different groups investigate different kinds of estimators. If the two groups use not only the same population data, but also the same previously set up samples, their results are highly comparable.
The computational performance of setting up multiple samples can be increased
by parallel computing. Since version 0.5.0, parallel computing in
simFrame
is implemented using the package parallel
, which is
part of the R base distribution since version 2.14.0 and builds upon work
done for the contributed packages multicore
and snow
. Note
that all objects and packages required for the computations (including
simFrame
) need to be made available on every worker process unless the
worker processes are created by forking (see
makeCluster
).
In order to prevent problems with random numbers and to ensure
reproducibility, random number streams should be used. With
parallel
, random number streams can be created via the
function clusterSetRNGStream()
.
The control class "SampleControl"
is highly flexible and allows
stratified sampling as well as sampling of whole groups rather than
individuals with a specified sampling method. Hence it is often sufficient
to implement the desired sampling method for the simple non-stratified case
to extend the existing framework. See "SampleControl"
for some restrictions on the argument names of such a function, which should
return a vector containing the indices of the sampled observations.
Nevertheless, for very complex sampling procedures, it is possible to define
a control class "MySampleControl"
extending
"VirtualSampleControl"
, and the corresponding method
clusterSetup(cl, x, control)
with signature 'ANY, data.frame,
MySampleControl'
. In order to optimize computational performance, it is
necessary to efficiently set up multiple samples. Thereby the slot k
of "VirtualSampleControl"
needs to be used to control the number of
samples, and the resulting object must be of class
"SampleSetup"
.
Value
An object of class "SampleSetup"
.
Methods
cl = "ANY", x = "data.frame", control = "character"
set up multiple samples on a cluster using a control class specified by the character string
control
. The slots of the control object may be supplied as additional arguments.cl = "ANY", x = "data.frame", control = "missing"
set up multiple samples on a cluster using a control object of class
"SampleControl"
. Its slots may be supplied as additional arguments.cl = "ANY", x = "data.frame", control = "SampleControl"
set up multiple samples on a cluster as defined by the control object
control
.
Author(s)
Andreas Alfons
References
Alfons, A., Templ, M. and Filzmoser, P. (2010) An Object-Oriented Framework for Statistical Simulation: The R Package simFrame. Journal of Statistical Software, 37(3), 1–36. doi: 10.18637/jss.v037.i03.
L'Ecuyer, P., Simard, R., Chen E and Kelton, W. (2002) An Object-Oriented Random-Number Package with Many Long Streams and Substreams. Operations Research, 50(6), 1073–1075.
Rossini, A., Tierney L. and Li, N. (2007) Simple Parallel Statistical Computing in R. Journal of Computational and Graphical Statistics, 16(2), 399–420.
Tierney, L., Rossini, A. and Li, N. (2009) snow
: A Parallel Computing
Framework for the R System. International Journal of Parallel
Programming, 37(1), 78–90.
See Also
makeCluster
,
clusterSetRNGStream
,
setup
, draw
,
"SampleControl"
, "TwoStageControl"
,
"VirtualSampleControl"
,
"SampleSetup"
Examples
## Not run:
# these examples require at least a dual core processor
# load data
data(eusilcP)
# start cluster
cl <- makeCluster(2, type = "PSOCK")
# load package and data on workers
clusterEvalQ(cl, {
library(simFrame)
data(eusilcP)
})
# set up random number stream
clusterSetRNGStream(cl, iseed = "12345")
# simple random sampling
srss <- clusterSetup(cl, eusilcP, size = 20, k = 4)
summary(srss)
draw(eusilcP[, c("id", "eqIncome")], srss, i = 1)
# group sampling
gss <- clusterSetup(cl, eusilcP, grouping = "hid", size = 10, k = 4)
summary(gss)
draw(eusilcP[, c("hid", "id", "eqIncome")], gss, i = 2)
# stratified simple random sampling
ssrss <- clusterSetup(cl, eusilcP, design = "region",
size = c(2, 5, 5, 3, 4, 5, 3, 5, 2), k = 4)
summary(ssrss)
draw(eusilcP[, c("id", "region", "eqIncome")], ssrss, i = 3)
# stratified group sampling
sgss <- clusterSetup(cl, eusilcP, design = "region",
grouping = "hid", size = c(2, 5, 5, 3, 4, 5, 3, 5, 2), k = 4)
summary(sgss)
draw(eusilcP[, c("hid", "id", "region", "eqIncome")], sgss, i = 4)
# stop cluster
stopCluster(cl)
## End(Not run)