runArraySimulation {SimDesign} | R Documentation |
Run a Monte Carlo simulation using array job submissions per condition
Description
This function has the same purpose as runSimulation
, however
rather than evaluating each row in a design
object (potentially with
parallel computing architecture) this function evaluates the simulation
per independent row condition. This is mainly useful when distributing the
jobs to HPC clusters where a job array number is available (e.g., via SLURM),
where the simulation results must be saved to independent files as they
complete. Use of expandDesign
is useful for distributing replications
to different jobs, while gen_seeds
is required to ensure high-quality
random number generation across the array submissions. See the associated
vignette for a brief tutorial of this setup.
Usage
runArraySimulation(
design,
...,
replications,
iseed,
filename,
dirname = NULL,
arrayID = getArrayID(),
filename_suffix = paste0("-", arrayID),
addArrayInfo = TRUE,
save_details = list(),
control = list()
)
Arguments
design |
design object containing simulation conditions on a per row basis.
This function is design to submit each row as in independent job on a HPC cluster.
See |
... |
additional arguments to be passed to |
replications |
number of independent replications to perform per
condition (i.e., each row in |
iseed |
initial seed to be passed to |
filename |
file name to save simulation files to (does not need to
specify extension). However, the array ID will be appended to each
|
dirname |
directory to save the files associated with |
arrayID |
array identifier from the scheduler. Must be a number between
1 and |
filename_suffix |
suffix to add to the |
addArrayInfo |
logical; should the array ID and original design row number
be added to the |
save_details |
optional list of extra file saving details.
See |
control |
control list passed to
Similarly, |
Details
Due to the nature of how the replication are split it is important that
the L'Ecuyer-CMRG (2002) method of random seeds is used across all
array ID submissions (cf. runSimulation
's parallel
approach, which uses this method to distribute random seeds within
each isolated condition rather than between all conditions). As such, this
function requires the seeds to be generated using
gen_seeds
with the iseed
and arrayID
inputs to ensure that each job is analyzing a high-quality
set of random numbers via L'Ecuyer-CMRG's (2002) method.
Additionally, for timed simulations on HPC clusters it is also recommended to pass a
control = list(max_time)
value to avoid discarding
conditions that require more than the specified time in the shell script.
The max_time
value should be less than the maximum time allocated
on the HPC cluster (e.g., approximately 90
depends on how long each replication takes). Simulations with missing
replication information should submit a new set of jobs at a later time
to collect the missing replication information.
Author(s)
Phil Chalmers rphilip.chalmers@gmail.com
References
Chalmers, R. P., & Adkins, M. C. (2020). Writing Effective and Reliable Monte Carlo Simulations
with the SimDesign Package. The Quantitative Methods for Psychology, 16
(4), 248-280.
doi:10.20982/tqmp.16.4.p248
Sigal, M. J., & Chalmers, R. P. (2016). Play it again: Teaching statistics with Monte
Carlo simulation. Journal of Statistics Education, 24
(3), 136-156.
doi:10.1080/10691898.2016.1246953
See Also
runSimulation
, expandDesign
,
gen_seeds
, SimCollect
, getArrayID
Examples
library(SimDesign)
Design <- createDesign(N = c(10, 20, 30))
Generate <- function(condition, fixed_objects) {
dat <- with(condition, rnorm(N, 10, 5)) # distributed N(10, 5)
dat
}
Analyse <- function(condition, dat, fixed_objects) {
ret <- c(mean=mean(dat), median=median(dat)) # mean/median of sample data
ret
}
Summarise <- function(condition, results, fixed_objects){
colMeans(results)
}
## Not run:
# define initial seed (do this only once to keep it constant!)
# iseed <- genSeeds()
iseed <- 554184288
### On cluster submission, the active array ID is obtained via getArrayID(),
### and therefore should be used in real SLURM submissions
arrayID <- getArrayID(type = 'slurm')
# However, for the following example array ID is set to first row only
arrayID <- 1L
# run the simulation (results not caught on job submission, only files saved)
res <- runArraySimulation(design=Design, replications=50,
generate=Generate, analyse=Analyse,
summarise=Summarise, arrayID=arrayID,
iseed=iseed, filename='mysim') # saved as 'mysim-1.rds'
res
SimResults(res) # condition and replication count stored
dir()
SimClean('mysim-1.rds')
########################
# Same submission job as above, however split the replications over multiple
# evaluations and combine when complete
Design5 <- expandDesign(Design, 5)
Design5
# iseed <- genSeeds()
iseed <- 554184288
# arrayID <- getArrayID(type = 'slurm')
arrayID <- 14L
# run the simulation (replications reduced per row, but same in total)
runArraySimulation(design=Design5, replications=10,
generate=Generate, analyse=Analyse,
summarise=Summarise, iseed=iseed,
filename='mylongsim', arrayID=arrayID)
res <- readRDS('mylongsim-14.rds')
res
SimResults(res) # condition and replication count stored
SimClean('mylongsim-14.rds')
###
# Emulate the arrayID distribution, storing all results in a 'sim/' folder
dir.create('sim/')
# Emulate distribution to nrow(Design5) = 15 independent job arrays
## (just used for presentation purposes on local computer)
sapply(1:nrow(Design5), \(arrayID)
runArraySimulation(design=Design5, replications=10,
generate=Generate, analyse=Analyse,
summarise=Summarise, iseed=iseed, arrayID=arrayID,
filename='condition', dirname='sim', # files: "sim/condition-#.rds"
control = list(max_time="04:00:00", max_RAM="4GB"))) |> invisible()
# If necessary, conditions above will manually terminate before
# 4 hours and 4GB of RAM are used, returning any
# successfully completed results before the HPC session times
# out (provided .slurm script specified more than 4 hours)
# list saved files
dir('sim/')
setwd('sim')
condition14 <- readRDS('condition-14.rds')
condition14
SimResults(condition14)
# aggregate simulation results into single file
final <- SimCollect(files=dir())
final
SimResults(final) |> View()
setwd('..')
SimClean(dirs='sim/')
## End(Not run)