prepare_set {dataPreparation} | R Documentation |
Preparation pipeline
Description
Full pipeline for preparing your data_set set.
Usage
prepare_set(data_set, final_form = "data.table", verbose = TRUE, ...)
Arguments
data_set |
Matrix, data.frame or data.table |
final_form |
"data.table" or "numerical_matrix" (default to data.table) |
verbose |
Should the algorithm talk? (logical, default to TRUE) |
... |
Additional parameters to tune pipeline (see details) |
Details
Additional arguments are available to tune pipeline:
-
key
Name of a column of data_set according to which data_set should be aggregated (character) -
analysis_date
A date at which the data_set should be aggregated (differences between every date and analysis_date will be computed) (Date) -
n_unfactor
Number of max value in a factor, set it to -1 to disableun_factor
function. (numeric, default to 53) -
digits
The number of digits after comma (optional, numeric, if set will performfast_round
) -
dateFormats
List of format of Dates in data_set (list of characters) -
name_separator
character to separate parts of new column names (character, default to ".") -
functions
Aggregation functions for numeric columns, seeaggregate_by_key
(list of functions names (character)) -
factor_date_type
Aggregation level to factorize date (seegenerate_factor_from_date
) (character, default to "yearmonth") -
target_col
A target column to perform target encoding, seetarget_encode
(character) -
target_encoding_functions
Functions to perform target encoding, seebuild_target_encoding
, iftarget_col
is not given will not do anything, (list, default to"mean"
)
Value
A data.table or a numerical matrix (according to final_form
).
It will perform the following steps:
Correct set: unfactor factor with many values, id dates and numeric that are hiden in character
Transform set: compute differences between every date, transform dates into factors, generate features from character..., if
key
is provided, will perform aggregate according to thiskey
Filter set: filter constant, in double or bijection variables. If 'digits' is provided, will round numeric
Handle NA: will perform
fast_handle_na
)Shape set: will put the result in asked shape (
final_form
) with acceptable columns format.
Examples
# Load ugly set
## Not run:
data(tiny_messy_adult)
# Have a look to set
head(tiny_messy_adult)
# Compute full pipeline
clean_adult <- prepare_set(tiny_messy_adult)
# With a reference date
adult_agg <- prepare_set(tiny_messy_adult, analysis_date = as.Date("2017-01-01"))
# Add aggregation by country
adult_agg <- prepare_set(tiny_messy_adult, analysis_date = as.Date("2017-01-01"), key = "country")
# With some new aggregation functions
power <- function(x) {sum(x^2)}
adult_agg <- prepare_set(tiny_messy_adult, analysis_date = as.Date("2017-01-01"), key = "country",
functions = c("min", "max", "mean", "power"))
## End(Not run)
# "##NOT RUN:" mean that this example hasn't been run on CRAN since its long. But you can run it!