R: ETL functions for working with medium sized data

etl_cleanup {etl}

R Documentation

ETL functions for working with medium sized data

Description

These generic functions provide a systematic approach for performing ETL (exchange-transform-load) operations on medium sized data.

Usage

etl_cleanup(obj, ...)

## Default S3 method:
etl_cleanup(
  obj,
  delete_raw = FALSE,
  delete_load = FALSE,
  pattern = "\\.(csv|zip)$",
  ...
)

etl_create(obj, ...)

## Default S3 method:
etl_create(obj, ...)

etl_update(obj, ...)

## Default S3 method:
etl_update(obj, ...)

etl_extract(obj, ...)

## Default S3 method:
etl_extract(obj, ...)

## S3 method for class 'etl_mtcars'
etl_extract(obj, ...)

## S3 method for class 'etl_cities'
etl_extract(obj, ...)

etl_load(obj, ...)

## Default S3 method:
etl_load(obj, ...)

etl_transform(obj, ...)

## Default S3 method:
etl_transform(obj, ...)

## S3 method for class 'etl_cities'
etl_transform(obj, ...)

Arguments

`obj`	an `etl` object
`...`	arguments passed to methods
`delete_raw`	should files be deleted from the `raw_dir`?
`delete_load`	should files be deleted from the `load_dir`?
`pattern`	regular expression matching file names to be deleted. By default, this matches filenames ending in `.csv` and `.zip`.

Details

The purposes of these functions are to download data from a particular data source from the Internet, process it, and load it into a SQL database server.

There are five primary functions:

etl_init: Initialize the database schema.
etl_extract: Download data from the Internet and store it locally in its raw form.
etl_transform: Manipulate the raw data such that it can be loaded into a database table. Usually, this means converting the raw data to (a series of) CSV files, which are also stored locally.
etl_load: Load the transformed data into the database.
etl_cleanup: Perform housekeeping, such as deleting unnecessary raw data files.

Additionally, two convenience functions chain these operations together:

etl_create: Run all five functions in succession. This is useful when you want to create the database from scratch.
etl_update: Run the etl_extract-etl_transform-etl_load functions in succession. This is useful when the database already exists, but you want to insert some new data.

Value

Each one of these functions returns an etl object, invisibly.

Examples


## Not run: 
if (require(RPostgreSQL)) {
  db <- src_postgres(dbname = "mtcars", user = "postgres", host = "localhost")
  cars <- etl("mtcars", db)
}
if (require(RMySQL) && mysqlHasDefault()) {
  db <- src_mysql(dbname = "mtcars", user = "r-user",
                  host = "localhost", password = "mypass")
  cars <- etl("mtcars", db)
}

## End(Not run)
cars <- etl("mtcars")
cars %>%
 etl_extract() %>%
 etl_transform() %>%
 etl_load() %>%
 etl_cleanup()
cars

cars %>%
 tbl(from = "mtcars") %>%
 group_by(cyl) %>%
 summarise(N = n(), mean_mpg = mean(mpg))

 # do it all in one step, and peek at the SQL creation script
 cars %>%
   etl_create(echo = TRUE)
 # specify a directory for the data
 ## Not run: 
 cars <- etl("mtcars", dir = "~/dumps/mtcars/")
 str(cars)
 
## End(Not run)
cars <- etl("mtcars")
# Do it step-by-step
cars %>%
  etl_extract() %>%
  etl_transform() %>%
  etl_load()

# Note the somewhat imprecise data types for the columns. These are the default.
tbl(cars, "mtcars")

# But you can also specify your own schema if you want
schema <- system.file("sql", "init.sqlite", package = "etl")
cars %>%
  etl_init(schema) %>%
  etl_load()

[Package etl version 0.4.1 Index]