chunk.apply {iotools} | R Documentation |
Process input by applying a function to each chunk
Description
chunk.apply
processes input in chunks and applies FUN
to each chunk, collecting the results.
Usage
chunk.apply(input, FUN, ..., CH.MERGE = rbind, CH.MAX.SIZE = 33554432,
CH.PARALLEL=1L, CH.SEQUENTIAL=TRUE, CH.BINARY=FALSE,
CH.INITIAL=NULL)
chunk.tapply(input, FUN, ..., sep = "\t", CH.MERGE = rbind, CH.MAX.SIZE = 33554432)
Arguments
input |
Either a chunk reader or a file name or connection that will be used to create a chunk reader |
FUN |
Function to apply to each chunk |
... |
Additional parameters passed to |
sep |
for |
CH.MERGE |
Function to call to merge results from all
chunks. Common values are |
CH.MAX.SIZE |
maximal size of each chunk in bytes |
CH.PARALLEL |
the number of parallel processes to use in the calculation (unix only). |
CH.SEQUENTIAL |
logical, only relevant for parallel
processing. If |
CH.BINARY |
logical, if |
CH.INITIAL |
Function which will be applied to the first chunk if
|
Details
Due to the fact that chunk-wise processing is typically used when the
input data is too large to fit in memory, there are additional
considerations depending on whether the results after applying
FUN
are itself large or not. If they are not, then the apporach
of accumulating them and then applying CH.MERGE
on all results
at once is typically the most efficient and it is what
CH.BINARY=FALSE
will do.
However, in some situations where the result are resonably big or
the number of chunks is very high, it may be more efficient to update
a sort of state based on each arriving chunk instead of collecting all
results. This can be achieved by setting CH.BINARY=TRUE
in which
case the process is equivalent to:
res <- CH.INITIAL(FUN(chunk1)) res <- CH.MERGE(res, FUN(chunk2)) res <- CH.MERGE(res, FUN(chunk3)) ... res
If CH.INITITAL
is NULL
then the first line is
res <- CH.MERGE(NULL, FUN(chunk1))
.
The parameter CH.SEQUENTIAL
is only used if parallel
processing is requested. It allows the system to process chunks out of
order for performace reasons. If it is TRUE
then the order of
the chunks is respected, but merging can only proceed if the result of
the next chunk is avaiable. With CH.SEQUENTIAL=FALSE
the workers
will continue processing further chunks as they become avaiable, not
waiting for the results of the preceding calls. This is more
efficient, but the order of the chunks in the result is not
deterministic.
Note that if parallel processing is required then all calls to
FUN
should be considered independent. However, CH.MERGE
is always run in the current session and thus is allowed to have
side-effects.
Value
The result of calling CH.MERGE
on all chunk results as
arguments (CH.BINARY=FALSE
) or result of the last call to
binary CH.MERGE
.
Note
The input to FUN
is the raw chunk, so typically it is
advisable to use mstrsplit
or similar function as the
first step in FUN
.
Note
The support for CH.PARALLEL
is considered experimental and may
change in the future.
Author(s)
Simon Urbanek
Examples
## Not run:
## compute quantiles of the first variable for each chunk
## of at most 10kB size
chunk.apply("input.file.txt",
function(o) {
m = mstrsplit(o, type='numeric')
quantile(m[,1], c(0.25, 0.5, 0.75))
}, CH.MAX.SIZE=1e5)
## End(Not run)