R: Unarchive a list of compressed tsv files into a database

unark {arkdb}

R Documentation

Unarchive a list of compressed tsv files into a database

Description

Unarchive a list of compressed tsv files into a database

Usage

unark(
  files,
  db_con,
  streamable_table = NULL,
  lines = 50000L,
  overwrite = "ask",
  encoding = Sys.getenv("encoding", "UTF-8"),
  tablenames = NULL,
  try_native = TRUE,
  ...
)

Arguments

`files`	vector of filenames to be read in. Must be `tsv` format, optionally compressed using `bzip2`, `gzip`, `zip`, or `xz` format at present.
`db_con`	a database src (`src_dbi` object from `dplyr`)
`streamable_table`	interface for serializing/deserializing in chunks
`lines`	number of lines to read in a chunk.
`overwrite`	should any existing text files of the same name be overwritten? default is "ask", which will ask for confirmation in an interactive session, and overwrite in a non-interactive script. TRUE will always overwrite, FALSE will always skip such tables.
`encoding`	encoding to be assumed for input files.
`tablenames`	vector of tablenames to be used for corresponding files. By default, tables will be named using lowercase names from file basename with special characters replaced with underscores (for SQL compatibility).
`try_native`	logical, default TRUE. Should we try to use a native bulk import method for the database connection? This can substantially speed up read times and will fall back on the DBI method for any table that fails to import. Currently only MonetDBLite connections support this.
`...`	additional arguments to `streamable_table$read` method.

Details

unark will read in a files in chunks and write them into a database. This is essential for processing large compressed tables which may be too large to read into memory before writing into a database. In general, increasing the lines parameter will result in a faster total transfer but require more free memory for working with these larger chunks.

If using readr-based streamable-table, you can suppress the progress bar by using options(readr.show_progress = FALSE) when reading in large files.

Value

the database connection (invisibly)

Examples


## Setup: create an archive.
library(dplyr)
dir <- tempdir()
db <- dbplyr::nycflights13_sqlite(tempdir())

## database -> .tsv.bz2
ark(db, dir)

## list all files in archive (full paths)
files <- list.files(dir, "bz2$", full.names = TRUE)

## Read archived files into a new database (another sqlite in this case)
new_db <- DBI::dbConnect(RSQLite::SQLite())
unark(files, new_db)

## Prove table is returned successfully.
tbl(new_db, "flights")

[Package arkdb version 0.0.18 Index]