ex_data_table_parallel {rqdatatable} | R Documentation |
Execute an rquery
pipeline with data.table
in parallel.
Description
Execute an rquery
pipeline with data.table
in parallel, partitioned by a given column.
Note: usually the overhead of partitioning and distributing the work will by far overwhelm any parallel speedup.
Also data.table
itself already seems to exploit some thread-level parallelism (one often sees user time > elapsed time).
Requires the parallel
package. For a worked example with significant speedup please see https://github.com/WinVector/rqdatatable/blob/master/extras/Parallel_rqdatatable.md.
Usage
ex_data_table_parallel(
optree,
partition_column,
cl = NULL,
...,
tables = list(),
source_limit = NULL,
debug = FALSE,
env = parent.frame()
)
Arguments
optree |
relop operations tree. |
partition_column |
character name of column to partition work by. |
cl |
a cluster object, created by package parallel or by package snow. If NULL, use the registered default cluster. |
... |
not used, force later arguments to bind by name. |
tables |
named list map from table names used in nodes to data.tables and data.frames. |
source_limit |
if not null limit all table sources to no more than this many rows (used for debugging). |
debug |
logical if TRUE use lapply instead of parallel::clusterApplyLB. |
env |
environment to look for values in. |
Details
Care must be taken that the calculation partitioning is course enough to ensure a correct calculation. For example: anything one is joining on, aggregating over, or ranking over must be grouped so that all elements affecting a given result row are in the same level of the partition.
Value
resulting data.table (intermediate tables can sometimes be mutated as is practice with data.table).