R: Data import utilities

import_src {ricu}

R Documentation

Data import utilities

Description

Making a dataset available to ricu consists of 3 steps: downloading (download_src()), importing (import_src()) and attaching (attach_src()). While downloading and importing are one-time procedures, attaching of the dataset is repeated every time the package is loaded. Briefly, downloading loads the raw dataset from the internet (most likely in .csv format), importing consists of some preprocessing to make the data available more efficiently and attaching sets up the data for use by the package.

Usage

import_src(x, ...)

## S3 method for class 'src_cfg'
import_src(
  x,
  data_dir = src_data_dir(x),
  tables = NULL,
  force = FALSE,
  verbose = TRUE,
  ...
)

## S3 method for class 'aumc_cfg'
import_src(x, ...)

## S3 method for class 'character'
import_src(
  x,
  data_dir = src_data_dir(x),
  tables = NULL,
  force = FALSE,
  verbose = TRUE,
  cleanup = FALSE,
  ...
)

import_tbl(x, ...)

## S3 method for class 'tbl_cfg'
import_tbl(
  x,
  data_dir = src_data_dir(x),
  progress = NULL,
  cleanup = FALSE,
  ...
)

Arguments

`x`	Object specifying the source configuration
`...`	Passed to downstream methods (finally to readr::read_csv/readr::read_csv_chunked)/generic consistency
`data_dir`	Destination directory where the downloaded data is written to.
`tables`	Character vector specifying the tables to download. If `NULL`, all available tables are downloaded.
`force`	Logical flag; if `TRUE`, existing data will be re-downloaded
`verbose`	Logical flag indicating whether to print progress information
`cleanup`	Logical flag indicating whether to remove raw csv files after conversion to fst
`progress`	Either `NULL` or a progress bar as created by `progress::progress_bar()`

Details

In order to speed up data access operations, ricu does not directly use the PhysioNet provided CSV files, but converts all data to fst::fst() format, which allows for random row and column access. Large tables are split into chunks in order to keep memory requirements reasonably low.

The one-time step per dataset of data import is fairly resource intensive: depending on CPU and available storage system, it will take on the order of an hour to run to completion and depending on the dataset, somewhere between 50 GB and 75 GB of temporary disk space are required as tables are uncompressed, in case of partitioned data, rows are reordered and the data again is saved to a storage efficient format.

The S3 generic function import_src() performs import of an entire data source, internally calling the S3 generic function import_tbl() in order to perform import of individual tables. Method dispatch is intended to occur on objects inheriting from src_cfg and tbl_cfg respectively. Such objects can be generated from JSON based configuration files which contain information such as table names, column types or row numbers, in order to provide safety in parsing of .csv files. For more information on data source configuration, refer to load_src_cfg().

Current import capabilities include re-saving a .csv file to .fst at once (used for smaller sized tables), reading a large .csv file using the readr::read_csv_chunked() API, while partitioning chunks and reassembling sub-partitions (used for splitting a large file into partitions), as well as re-partitioning an already partitioned table according to a new partitioning scheme. Care has been taken to keep the maximal memory requirements for this reasonably low, such that data import is feasible on laptop class hardware.

Value

Called for side effects and returns NULL invisibly.

Examples

## Not run: 

dir <- tempdir()
list.files(dir)

download_src("mimic_demo", dir)
list.files(dir)

import_src("mimic_demo", dir)
list.files(dir)

unlink(dir, recursive = TRUE)


## End(Not run)

[Package ricu version 0.5.6 Index]