etl_cleanup {etl} | R Documentation |
ETL functions for working with medium sized data
Description
These generic functions provide a systematic approach for performing ETL (exchange-transform-load) operations on medium sized data.
Usage
etl_cleanup(obj, ...)
## Default S3 method:
etl_cleanup(
obj,
delete_raw = FALSE,
delete_load = FALSE,
pattern = "\\.(csv|zip)$",
...
)
etl_create(obj, ...)
## Default S3 method:
etl_create(obj, ...)
etl_update(obj, ...)
## Default S3 method:
etl_update(obj, ...)
etl_extract(obj, ...)
## Default S3 method:
etl_extract(obj, ...)
## S3 method for class 'etl_mtcars'
etl_extract(obj, ...)
## S3 method for class 'etl_cities'
etl_extract(obj, ...)
etl_load(obj, ...)
## Default S3 method:
etl_load(obj, ...)
etl_transform(obj, ...)
## Default S3 method:
etl_transform(obj, ...)
## S3 method for class 'etl_cities'
etl_transform(obj, ...)
Arguments
obj |
an |
... |
arguments passed to methods |
delete_raw |
should files be deleted from the |
delete_load |
should files be deleted from the |
pattern |
regular expression matching file names to be deleted. By default,
this matches filenames ending in |
Details
The purposes of these functions are to download data from a particular data source from the Internet, process it, and load it into a SQL database server.
There are five primary functions:
etl_init
Initialize the database schema.
- etl_extract
Download data from the Internet and store it locally in its raw form.
- etl_transform
Manipulate the raw data such that it can be loaded into a database table. Usually, this means converting the raw data to (a series of) CSV files, which are also stored locally.
- etl_load
Load the transformed data into the database.
- etl_cleanup
Perform housekeeping, such as deleting unnecessary raw data files.
Additionally, two convenience functions chain these operations together:
- etl_create
Run all five functions in succession. This is useful when you want to create the database from scratch.
- etl_update
Run the
etl_extract
-etl_transform
-etl_load
functions in succession. This is useful when the database already exists, but you want to insert some new data.
Value
Each one of these functions returns an etl
object, invisibly.
See Also
Examples
## Not run:
if (require(RPostgreSQL)) {
db <- src_postgres(dbname = "mtcars", user = "postgres", host = "localhost")
cars <- etl("mtcars", db)
}
if (require(RMySQL) && mysqlHasDefault()) {
db <- src_mysql(dbname = "mtcars", user = "r-user",
host = "localhost", password = "mypass")
cars <- etl("mtcars", db)
}
## End(Not run)
cars <- etl("mtcars")
cars %>%
etl_extract() %>%
etl_transform() %>%
etl_load() %>%
etl_cleanup()
cars
cars %>%
tbl(from = "mtcars") %>%
group_by(cyl) %>%
summarise(N = n(), mean_mpg = mean(mpg))
# do it all in one step, and peek at the SQL creation script
cars %>%
etl_create(echo = TRUE)
# specify a directory for the data
## Not run:
cars <- etl("mtcars", dir = "~/dumps/mtcars/")
str(cars)
## End(Not run)
cars <- etl("mtcars")
# Do it step-by-step
cars %>%
etl_extract() %>%
etl_transform() %>%
etl_load()
# Note the somewhat imprecise data types for the columns. These are the default.
tbl(cars, "mtcars")
# But you can also specify your own schema if you want
schema <- system.file("sql", "init.sqlite", package = "etl")
cars %>%
etl_init(schema) %>%
etl_load()