ark {arkdb} | R Documentation |
Archive tables from a database as flat files
Description
Archive tables from a database as flat files
Usage
ark(
db_con,
dir,
streamable_table = streamable_base_tsv(),
lines = 50000L,
compress = c("bzip2", "gzip", "xz", "none"),
tables = list_tables(db_con),
method = c("keep-open", "window", "window-parallel", "sql-window"),
overwrite = "ask",
filter_statement = NULL,
filenames = NULL,
callback = NULL
)
Arguments
db_con |
a database connection |
dir |
a directory where we will write the compressed text files output |
streamable_table |
interface for serializing/deserializing in chunks |
lines |
the number of lines to use in each single chunk |
compress |
file compression algorithm. Should be one of "bzip2" (default), "gzip" (faster write times, a bit less compression), "xz", or "none", for no compression. |
tables |
a list of tables from the database that should be archived. By default, will archive all tables. Table list should specify schema if appropriate, see examples. |
method |
method to use to query the database, see details. |
overwrite |
should any existing text files of the same name be overwritten? default is "ask", which will ask for confirmation in an interactive session, and overwrite in a non-interactive script. TRUE will always overwrite, FALSE will always skip such tables. |
filter_statement |
Typically an SQL "WHERE" clause, specific to your
dataset. (e.g., |
filenames |
An optional vector of names that will be used to name the
files instead of using the tablename from the |
callback |
An optional function that acts on the data.frame before it is
written to disk by |
Details
ark
will archive tables from a database as (compressed) tsv files.
Or other formats that have a streamtable_table method
, like parquet.
ark
does this by reading only chunks at a time into memory, allowing it to
process tables that would be too large to read into memory all at once (which
is probably why you are using a database in the first place!) Compressed
text files will likely take up much less space, making them easier to store and
transfer over networks. Compressed plain-text files are also more archival
friendly, as they rely on widely available and long-established open source compression
algorithms and plain text, making them less vulnerable to loss by changes in
database technology and formats.
In almost all cases, the default method should be the best choice.
If the DBI::dbSendQuery()
implementation for your database platform returns the
full results to the client immediately rather than supporting chunking with n
parameter, you may want to use "window" method, which is the most generic. The
"sql-window" method provides a faster alternative for databases like PostgreSQL that
support windowing natively (i.e. BETWEEN
queries). Note that "window-parallel"
only works with streamable_parquet
.
Value
the path to dir
where output files are created (invisibly), for piping.
Examples
# setup
library(dplyr)
dir <- tempdir()
db <- dbplyr::nycflights13_sqlite(tempdir())
## And here we go:
ark(db, dir)
## Not run:
## For a Postgres DB with schema, we can append schema names first
## to each of the table names, like so:
schema_tables <- dbGetQuery(db, sqlInterpolate(db,
"SELECT table_name FROM information_schema.tables
WHERE table_schema = ?schema",
schema = "schema_name"
))
ark(db, dir, tables = paste0("schema_name", ".", schema_tables$table_name))
## End(Not run)