R: Preparation pipeline

prepare_set {dataPreparation}

R Documentation

Preparation pipeline

Description

Full pipeline for preparing your data_set set.

Usage

prepare_set(data_set, final_form = "data.table", verbose = TRUE, ...)

Arguments

`data_set`	Matrix, data.frame or data.table
`final_form`	"data.table" or "numerical_matrix" (default to data.table)
`verbose`	Should the algorithm talk? (logical, default to TRUE)
`...`	Additional parameters to tune pipeline (see details)

Details

Additional arguments are available to tune pipeline:

key Name of a column of data_set according to which data_set should be aggregated (character)
analysis_date A date at which the data_set should be aggregated (differences between every date and analysis_date will be computed) (Date)
n_unfactor Number of max value in a factor, set it to -1 to disable un_factor function. (numeric, default to 53)
digits The number of digits after comma (optional, numeric, if set will perform fast_round)
dateFormats List of format of Dates in data_set (list of characters)
name_separator character to separate parts of new column names (character, default to ".")
functions Aggregation functions for numeric columns, see aggregate_by_key (list of functions names (character))
factor_date_type Aggregation level to factorize date (see generate_factor_from_date) (character, default to "yearmonth")
target_col A target column to perform target encoding, see target_encode (character)
target_encoding_functions Functions to perform target encoding, see build_target_encoding, if target_col is not given will not do anything, (list, default to "mean")

Value

A data.table or a numerical matrix (according to final_form).
It will perform the following steps:

Correct set: unfactor factor with many values, id dates and numeric that are hiden in character
Transform set: compute differences between every date, transform dates into factors, generate features from character..., if key is provided, will perform aggregate according to this key
Filter set: filter constant, in double or bijection variables. If 'digits' is provided, will round numeric
Handle NA: will perform fast_handle_na)
Shape set: will put the result in asked shape (final_form) with acceptable columns format.

Examples

# Load ugly set
## Not run: 
data(tiny_messy_adult)

# Have a look to set
head(tiny_messy_adult)

# Compute full pipeline
clean_adult <- prepare_set(tiny_messy_adult)

# With a reference date
adult_agg <- prepare_set(tiny_messy_adult, analysis_date = as.Date("2017-01-01"))

# Add aggregation by country
adult_agg <- prepare_set(tiny_messy_adult, analysis_date = as.Date("2017-01-01"), key = "country")

# With some new aggregation functions
power <- function(x) {sum(x^2)}
adult_agg <- prepare_set(tiny_messy_adult, analysis_date = as.Date("2017-01-01"), key = "country",
                        functions = c("min", "max", "mean", "power"))

## End(Not run)
# "##NOT RUN:" mean that this example hasn't been run on CRAN since its long. But you can run it!

[Package dataPreparation version 1.1.1 Index]