R: Automate Data Preprocess for Modeling

model_preprocess {lares}

R Documentation

Automate Data Preprocess for Modeling

Description

Pre-process your data before training a model. This is the prior step on the h2o_automl() function's pipeline. Enabling for other use cases when wanting too use any other framework, library, or custom algorithm.

Usage

model_preprocess(
  df,
  y = "tag",
  ignore = NULL,
  train_test = NA,
  split = 0.7,
  weight = NULL,
  target = "auto",
  balance = FALSE,
  impute = FALSE,
  no_outliers = TRUE,
  unique_train = TRUE,
  center = FALSE,
  scale = FALSE,
  thresh = 10,
  seed = 0,
  quiet = FALSE
)

Arguments

`df`	Dataframe. Dataframe containing all your data, including the dependent variable labeled as `'tag'`. If you want to define which variable should be used instead, use the `y` parameter.
`y`	Character. Column name for dependent variable or response.
`ignore`	Character vector. Force columns for the model to ignore
`train_test`	Character. If needed, `df`'s column name with 'test' and 'train' values to split
`split`	Numeric. Value between 0 and 1 to split as train/test datasets. Value is for training set. Set value to 1 to train with all available data and test with same data (cross-validation will still be used when training). If `train_test` is set, value will be overwritten with its real split rate.
`weight`	Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed.
`target`	Value. Which is your target positive value? If set to `'auto'`, the target with largest `mean(score)` will be selected. Change the value to overwrite. Only used when binary categorical model.
`balance`	Boolean. Auto-balance train dataset with under-sampling?
`impute`	Boolean. Fill `NA` values with MICE?
`no_outliers`	Boolean/Numeric. Remove `y`'s outliers from the dataset? Will remove those values that are farther than n standard deviations from the dependent variable's mean (Z-score). Set to `TRUE` for default (3) or numeric to set a different multiplier.
`unique_train`	Boolean. Keep only unique row observations for training data?
`center`, `scale`	Boolean. Using the base function scale, do you wish to center and/or scale all numerical values?
`thresh`	Integer. Threshold for selecting binary or regression models: this number is the threshold of unique values we should have in `'tag'` (more than: regression; less than: classification)
`seed`	Integer. Set a seed for reproducibility. AutoML can only guarantee reproducibility if max_models is used because max_time is resource limited.
`quiet`	Boolean. Quiet all messages, warnings, recommendations?

Value

List. Contains original data.frame df, an index to identify which observations with be part of the train dataset train_index, and which model type should be model_type.

Examples

data(dft) # Titanic dataset

model_preprocess(dft, "Survived", balance = TRUE)

model_preprocess(dft, "Fare", split = 0.5, scale = TRUE)

model_preprocess(dft, "Pclass", ignore = c("Fare", "Cabin"))

model_preprocess(dft, "Pclass", quiet = TRUE)