R: Execute an 'rquery' pipeline with 'data.table' in parallel.

ex_data_table_parallel {rqdatatable}

R Documentation

Execute an `rquery` pipeline with `data.table` in parallel.

Description

Execute an rquery pipeline with data.table in parallel, partitioned by a given column. Note: usually the overhead of partitioning and distributing the work will by far overwhelm any parallel speedup. Also data.table itself already seems to exploit some thread-level parallelism (one often sees user time > elapsed time). Requires the parallel package. For a worked example with significant speedup please see https://github.com/WinVector/rqdatatable/blob/master/extras/Parallel_rqdatatable.md.

Usage

ex_data_table_parallel(
  optree,
  partition_column,
  cl = NULL,
  ...,
  tables = list(),
  source_limit = NULL,
  debug = FALSE,
  env = parent.frame()
)

Arguments

`optree`	relop operations tree.
`partition_column`	character name of column to partition work by.
`cl`	a cluster object, created by package parallel or by package snow. If NULL, use the registered default cluster.
`...`	not used, force later arguments to bind by name.
`tables`	named list map from table names used in nodes to data.tables and data.frames.
`source_limit`	if not null limit all table sources to no more than this many rows (used for debugging).
`debug`	logical if TRUE use lapply instead of parallel::clusterApplyLB.
`env`	environment to look for values in.

Details

Care must be taken that the calculation partitioning is course enough to ensure a correct calculation. For example: anything one is joining on, aggregating over, or ranking over must be grouped so that all elements affecting a given result row are in the same level of the partition.

Value

resulting data.table (intermediate tables can sometimes be mutated as is practice with data.table).