import_src {ricu} | R Documentation |
Data import utilities
Description
Making a dataset available to ricu
consists of 3 steps: downloading
(download_src()
), importing (import_src()
) and attaching
(attach_src()
). While downloading and importing are one-time procedures,
attaching of the dataset is repeated every time the package is loaded.
Briefly, downloading loads the raw dataset from the internet (most likely
in .csv
format), importing consists of some preprocessing to make the
data available more efficiently and attaching sets up the data for use by
the package.
Usage
import_src(x, ...)
## S3 method for class 'src_cfg'
import_src(
x,
data_dir = src_data_dir(x),
tables = NULL,
force = FALSE,
verbose = TRUE,
...
)
## S3 method for class 'aumc_cfg'
import_src(x, ...)
## S3 method for class 'character'
import_src(
x,
data_dir = src_data_dir(x),
tables = NULL,
force = FALSE,
verbose = TRUE,
cleanup = FALSE,
...
)
import_tbl(x, ...)
## S3 method for class 'tbl_cfg'
import_tbl(
x,
data_dir = src_data_dir(x),
progress = NULL,
cleanup = FALSE,
...
)
Arguments
x |
Object specifying the source configuration |
... |
Passed to downstream methods (finally to readr::read_csv/readr::read_csv_chunked)/generic consistency |
data_dir |
Destination directory where the downloaded data is written to. |
tables |
Character vector specifying the tables to download. If
|
force |
Logical flag; if |
verbose |
Logical flag indicating whether to print progress information |
cleanup |
Logical flag indicating whether to remove raw csv files after conversion to fst |
progress |
Either |
Details
In order to speed up data access operations, ricu
does not directly use
the PhysioNet provided CSV files, but converts all data to fst::fst()
format, which allows for random row and column access. Large tables are
split into chunks in order to keep memory requirements reasonably low.
The one-time step per dataset of data import is fairly resource intensive: depending on CPU and available storage system, it will take on the order of an hour to run to completion and depending on the dataset, somewhere between 50 GB and 75 GB of temporary disk space are required as tables are uncompressed, in case of partitioned data, rows are reordered and the data again is saved to a storage efficient format.
The S3 generic function import_src()
performs import of an entire data
source, internally calling the S3 generic function import_tbl()
in order
to perform import of individual tables. Method dispatch is intended to
occur on objects inheriting from src_cfg
and tbl_cfg
respectively. Such
objects can be generated from JSON based configuration files which contain
information such as table names, column types or row numbers, in order to
provide safety in parsing of .csv
files. For more information on data
source configuration, refer to load_src_cfg()
.
Current import capabilities include re-saving a .csv
file to .fst
at
once (used for smaller sized tables), reading a large .csv
file using the
readr::read_csv_chunked()
API, while partitioning chunks and reassembling
sub-partitions (used for splitting a large file into partitions), as well
as re-partitioning an already partitioned table according to a new
partitioning scheme. Care has been taken to keep the maximal memory
requirements for this reasonably low, such that data import is feasible on
laptop class hardware.
Value
Called for side effects and returns NULL
invisibly.
Examples
## Not run:
dir <- tempdir()
list.files(dir)
download_src("mimic_demo", dir)
list.files(dir)
import_src("mimic_demo", dir)
list.files(dir)
unlink(dir, recursive = TRUE)
## End(Not run)