Start {startR} | R Documentation |
Declare, discover, subset and retrieve multidimensional distributed data sets
Description
See the startR documentation and
tutorial for a step-by-step explanation on how to use Start().
Nowadays in the era of big data, large multidimensional data sets from
diverse sources need to be combined and processed. Analysis of big data in any
field is often highly complex and time-consuming. Taking subsets of these data
sets and processing them efficiently become an indispensable practice. This
technique is also known as Domain Decomposition, Map Reduce or, more commonly,
'chunking'.
startR (Subset, TrAnsform, ReTrieve, arrange and process large
multidimensional data sets in R) is an R project started at BSC with the aim
to develop a tool that allows the user to automatically process large
multidimensional distributed data sets. It is an open source project that is
open to external collaboration and funding, and will continuously evolve to
support as many data set formats as possible while maximizing its efficiency.
startR provides a framework under which a data set (collection of one
or multiple data files, potentially distributed over various remote servers)
are perceived as if they all were part of a single large multidimensional
array. Once such multidimensional array is declared, any user-defined function
can be applied to the data in a apply
-like fashion, where startR
transparently implements the Map Reduce paradigm. The steps to follow in order
to process a collection of big data sets are as follows:
-
Declaring the data set, i.e. declaring the distribution of the data files involved, the dimensions and shape of the multidimensional array, and the boundaries of the target data. This step can be performed with the Start() function. Numeric indices or coordinate values can be used when fixing the boundaries. It is common having the need to apply transformations, pre-processing or reordering to the data. Start() accepts user-defined transformation or reordering functions to be applied for such purposes. Once a data set is declared, a list of involved files, dimension lengths, memory size and other metadata is made available. Optionally, the data set can be retrieved and loaded onto the current R session if it is small enough.
-
Declaring the workflow of operations to perform on the involved data set(s). This step can be performed with the Step() and AddStep() functions.
-
Defining the computation settings. The mandatory settings include a) how many subsets to divide the data sets into and along which dimensions; b) which platform to perform the workflow of operations on (local machine or remote machine/HPC?), how to communicate with it (unidirectional or bidirectional connection? shared or separate file systems?), which queuing system it uses (slurm, PBS, LSF, none?); and c) how many parallel jobs and execution threads per job to use when running the calculations. This step can be performed when building up the call to the Compute() function.
-
Running the computation. startR transparently implements the Map Reduce paradigm, according to the settings in the previous steps. The progress can optionally be monitored with the EC-Flow workflow management tool. When the computation ends, a report of performance timings is displayed. This step can be triggered with the Compute() function.
startR is not bound to a specific file format. Interface functions to custom file formats can be provided for Start() to read them. As this version, startR includes interface functions to the following file formats:
-
NetCDF
Metadata and auxilliary data is also preserved and arranged by Start() in the measure that it is retrieved by the interface functions for a specific file format.
Usage
Start(
...,
return_vars = NULL,
synonims = NULL,
file_opener = NcOpener,
file_var_reader = NcVarReader,
file_dim_reader = NcDimReader,
file_data_reader = NcDataReader,
file_closer = NcCloser,
transform = NULL,
transform_params = NULL,
transform_vars = NULL,
transform_extra_cells = 2,
apply_indices_after_transform = FALSE,
pattern_dims = NULL,
metadata_dims = NULL,
selector_checker = SelectorChecker,
merge_across_dims = FALSE,
merge_across_dims_narm = TRUE,
split_multiselected_dims = FALSE,
path_glob_permissive = FALSE,
largest_dims_length = FALSE,
retrieve = FALSE,
num_procs = 1,
ObjectBigmemory = NULL,
silent = FALSE,
debug = FALSE
)
Arguments
... |
A selection of custemized parameters depending on the data
format. When we retrieve data from one or a collection of data sets,
the involved data can be perceived as belonging to a large multi-dimensional
array. For instance, let us consider an example case. We want to retrieve data
from a source, which contains data for the number of monthly sales of various
items, and also for their retail price each month. The data on source is
stored as follows:
For each dimension, the 3 first information items can be specified with a set
of parameters to be provided through |
return_vars |
A named list where the names are the names of the
variables to be fetched in the files, and the values are vectors of
character strings with the names of the file dimension which to retrieve each
variable for, or NULL if the variable has to be retrieved only once
from any (the first) of the involved files. |
synonims |
A named list where the names are the requested variable or
dimension names, and the values are vectors of character strings with
alternative names to seek for such dimension or variable. |
file_opener |
A function that receives as a single parameter
'file_path' a character string with the path to a file to be opened,
and returns an object with an open connection to the file (optionally with
header information) on success, or returns NULL on failure.
|
file_var_reader |
A function with the header |
file_dim_reader |
A function with the header |
file_data_reader |
A function with the header |
file_closer |
A function that receives as a single parameter
'file_object' an open connection (as returned by 'file_opener')
to one of the files to be read, optionally with header information, and
closes the open connection. Always returns NULL.
|
transform |
A function with the header |
transform_params |
A named list with additional parameters to be sent to the 'transform' function (if specified). See documentation on parameter 'transform' for details. |
transform_vars |
A vector of character strings with the names of auxiliary variables to be sent to the 'transform' function (if specified). All the variables to be sent to 'transform' must also have been requested as return variables in the parameter 'return_vars' of Start(). |
transform_extra_cells |
An integer of extra indices to retrieve from the
data set, beyond the requested indices in |
apply_indices_after_transform |
A logical value indicating when a 'transform' is specified in Start() and numeric indices are provided for any of the inner dimensions that depend on coordinate variables, these numeric indices can be made effective (retrieved) before applying the transformation or after. The boolean flag allows to adjust this behaviour. It takes FALSE by default (numeric indices are applied before sending data to 'transform'). |
pattern_dims |
A character string indicating the name of the dimension
with path pattern specifications (see |
metadata_dims |
A vector of character strings with the names of the file
dimensions which to return metadata for. As noted in 'file_data_reader',
the data reader can optionally return auxiliary data via the attribute
'variables' of the returned array. Start() by default returns the
auxiliary data read for only the first file of each source (or data set) in
the pattern dimension (see |
selector_checker |
A function used internaly by Start() to translate a set of selectors (values for a dimension associated to a coordinate variable) into a set of numeric indices. It takes by default SelectorChecker() and, in principle, it should not be required to change it for customized file formats. The option to replace it is left open for more versatility. See the code of SelectorChecker() for details on the inputs, functioning and outputs of a selector checker. |
merge_across_dims |
A logical value indicating whether to merge
dimensions across which another dimension extends (according to the
'<dimname>_across' parameters). Takes the value FALSE by default. For
example, if the dimension 'time' extends across the dimension 'chunk' and
|
merge_across_dims_narm |
A logical value indicating whether to remove the additional NAs from data when parameter 'merge_across_dims' is TRUE. It is helpful when the length of the to-be-merged dimension is different across another dimension. For example, if the dimension 'time' extends across dimension 'chunk', and the time length along the first chunk is 2 while along the second chunk is 10. Setting this parameter as TRUE can remove the additional 8 NAs at position 3 to 10. The default value is TRUE, but will be automatically turned to FALSE if 'merge_across_dims = FALSE'. |
split_multiselected_dims |
A logical value indicating whether to split a dimension that has been selected with a multidimensional array of selectors into as many dimensions as present in the selector array. The default value is FALSE. |
path_glob_permissive |
A logical value or an integer specifying how many
folder levels in the path pattern, beginning from the end, the shell glob
expressions must be preserved and worked out for each file. The default
value is FALSE, which is equivalent to 0. TRUE is equivalent to 1. |
largest_dims_length |
A logical value or a named integer vector
indicating if Start() should examine all the files to get the largest
length of the inner dimensions (TRUE) or use the first valid file of each
dataset as the returned dimension length (FALSE). Since examining all the
files could be time-consuming, a vector can be used to explicitly specify
the expected length of the inner dimensions. For those inner dimensions not
specified, the first valid file will be used. The default value is FALSE. |
retrieve |
A logical value indicating whether to retrieve the data defined in the Start() call or to explore only its dimension lengths and names, and the values for the file and inner dimensions. The default value is FALSE. |
num_procs |
An integer of number of processes to be created for the parallel execution of the retrieval/transformation/arrangement of the multiple involved files in a call to Start(). If set to NULL, takes the number of available cores (as detected by future::availableCores). The default value is 1 (no parallel execution). |
ObjectBigmemory |
a character string to be included as part of the bigmemory object name. This parameter is thought to be used internally by the chunking capabilities of startR. |
silent |
A logical value of whether to display progress messages (FALSE) or not (TRUE). The default value is FALSE. |
debug |
A logical value of whether to return detailed messages on the progress and operations in a Start() call (TRUE) or not (FALSE). The default value is FALSE. |
Value
If retrieve = TRUE
the involved data is loaded into RAM memory
and an object of the class 'startR_cube' with the following components is
returned:
Data |
Multidimensional data array with named dimensions, with the data values
requested via |
Variables |
Named list of 1 + N components, containing lists of retrieved variables (as
requested in 'return_vars') common to all the data sources (in the 1st
component, |
Files |
Multidimensonal character string array with named dimensions. Its dimensions
are the file dimensions (as requested in |
NotFoundFiles |
Array with the same shape as |
FileSelectors |
Multidimensional character string array with named dimensions, with the same
shape as |
If retrieve = FALSE
the involved data is not loaded into RAM memory and
an object of the class 'startR_header' with the following components is
returned:
Dimensions |
Named vector with the dimension lengths and names of the data involved in the Start() call. |
Variables |
Named list of 1 + N components, containing lists of retrieved variables (as
requested in 'return_vars') common to all the data sources (in the 1st
component, |
Files |
Multidimensonal character string array with named dimensions. Its dimensions are the file dimensions (as requested in ...). Each cell in this array contains a path to a file to be retrieved (which may exist or not). |
FileSelectors |
Multidimensional character string array with named dimensions, with the same
shape as |
StartRCall |
List of parameters sent to the Start() call, with the parameter 'retrieve' set to TRUE. Intended for calling in order to retrieve the associated data a posteriori with a call to do.call(). |
Examples
data_path <- system.file('extdata', package = 'startR')
path_obs <- file.path(data_path, 'obs/monthly_mean/$var$/$var$_$sdate$.nc')
sdates <- c('200011', '200012')
data <- Start(dat = list(list(path = path_obs)),
var = 'tos',
sdate = sdates,
time = 'all',
latitude = 'all',
longitude = 'all',
return_vars = list(latitude = 'dat',
longitude = 'dat',
time = 'sdate'),
retrieve = FALSE)