R: Chunk Jobs for Sequential Execution

chunk {batchtools}

R Documentation

Chunk Jobs for Sequential Execution

Description

Jobs can be partitioned into “chunks” to be executed sequentially on the computational nodes. Chunks are defined by providing a data frame with columns “job.id” and “chunk” (integer) to submitJobs. All jobs with the same chunk number will be grouped together on one node to form a single computational job.

The function chunk simply splits x into either a fixed number of groups, or into a variable number of groups with a fixed number of maximum elements.

The function lpt also groups x into a fixed number of chunks, but uses the actual values of x in a greedy “Longest Processing Time” algorithm. As a result, the maximum sum of elements in minimized.

binpack splits x into a variable number of groups whose sum of elements do not exceed the upper limit provided by chunk.size.

See examples of estimateRuntimes for an application of binpack and lpt.

Usage

chunk(x, n.chunks = NULL, chunk.size = NULL, shuffle = TRUE)

lpt(x, n.chunks = 1L)

binpack(x, chunk.size = max(x))

Arguments

`x`	[`numeric`] For `chunk` an atomic vector (usually the `job.id`). For `binpack` and `lpt`, the weights to group.
`n.chunks`	[`integer(1)`] Requested number of chunks. The function `chunk` distributes the number of elements in `x` evenly while `lpt` tries to even out the sum of elements in each chunk. If more chunks than necessary are requested, empty chunks are ignored. Mutually exclusive with `chunks.size`.
`chunk.size`	[`integer(1)`] Requested chunk size for each single chunk. For `chunk` this is the number of elements in `x`, for `binpack` the size is determined by the sum of values in `x`. Mutually exclusive with `n.chunks`.
`shuffle`	[`logical(1)`] Shuffles the groups. Default is `TRUE`.

Value

[integer] giving the chunk number for each element of x.

Examples


ch = chunk(1:10, n.chunks = 2)
table(ch)

ch = chunk(rep(1, 10), chunk.size = 2)
table(ch)

set.seed(1)
x = runif(10)
ch = lpt(x, n.chunks = 2)
sapply(split(x, ch), sum)

set.seed(1)
x = runif(10)
ch = binpack(x, 1)
sapply(split(x, ch), sum)

# Job chunking
tmp = makeRegistry(file.dir = NA, make.default = FALSE)
ids = batchMap(identity, 1:25, reg = tmp)

### Group into chunks with 10 jobs each
library(data.table)
ids[, chunk := chunk(job.id, chunk.size = 10)]
print(ids[, .N, by = chunk])

### Group into 4 chunks
ids[, chunk := chunk(job.id, n.chunks = 4)]
print(ids[, .N, by = chunk])

### Submit to batch system
submitJobs(ids = ids, reg = tmp)

# Grouped chunking
tmp = makeExperimentRegistry(file.dir = NA, make.default = FALSE)
prob = addProblem(reg = tmp, "prob1", data = iris, fun = function(job, data) nrow(data))
prob = addProblem(reg = tmp, "prob2", data = Titanic, fun = function(job, data) nrow(data))
algo = addAlgorithm(reg = tmp, "algo", fun = function(job, data, instance, i, ...) problem)
prob.designs = list(prob1 = data.table(), prob2 = data.table(x = 1:2))
algo.designs = list(algo = data.table(i = 1:3))
addExperiments(prob.designs, algo.designs, repls = 3, reg = tmp)

### Group into chunks of 5 jobs, but do not put multiple problems into the same chunk
# -> only one problem has to be loaded per chunk, and only once because it is cached
ids = getJobTable(reg = tmp)[, .(job.id, problem, algorithm)]
ids[, chunk := chunk(job.id, chunk.size = 5), by = "problem"]
ids[, chunk := .GRP, by = c("problem", "chunk")]
dcast(ids, chunk ~ problem)