model_preprocess {lares} | R Documentation |
Automate Data Preprocess for Modeling
Description
Pre-process your data before training a model. This is the prior step
on the h2o_automl()
function's pipeline. Enabling for
other use cases when wanting too use any other framework, library,
or custom algorithm.
Usage
model_preprocess(
df,
y = "tag",
ignore = NULL,
train_test = NA,
split = 0.7,
weight = NULL,
target = "auto",
balance = FALSE,
impute = FALSE,
no_outliers = TRUE,
unique_train = TRUE,
center = FALSE,
scale = FALSE,
thresh = 10,
seed = 0,
quiet = FALSE
)
Arguments
df |
Dataframe. Dataframe containing all your data, including
the dependent variable labeled as |
y |
Character. Column name for dependent variable or response. |
ignore |
Character vector. Force columns for the model to ignore |
train_test |
Character. If needed, |
split |
Numeric. Value between 0 and 1 to split as train/test
datasets. Value is for training set. Set value to 1 to train with all
available data and test with same data (cross-validation will still be
used when training). If |
weight |
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. |
target |
Value. Which is your target positive value? If
set to |
balance |
Boolean. Auto-balance train dataset with under-sampling? |
impute |
Boolean. Fill |
no_outliers |
Boolean/Numeric. Remove |
unique_train |
Boolean. Keep only unique row observations for training data? |
center , scale |
Boolean. Using the base function scale, do you wish to center and/or scale all numerical values? |
thresh |
Integer. Threshold for selecting binary or regression
models: this number is the threshold of unique values we should
have in |
seed |
Integer. Set a seed for reproducibility. AutoML can only guarantee reproducibility if max_models is used because max_time is resource limited. |
quiet |
Boolean. Quiet all messages, warnings, recommendations? |
Value
List. Contains original data.frame df
, an index
to identify which observations with be part of the train dataset
train_index
, and which model type should be model_type
.
See Also
Other Machine Learning:
ROC()
,
conf_mat()
,
export_results()
,
gain_lift()
,
h2o_automl()
,
h2o_predict_MOJO()
,
h2o_selectmodel()
,
impute()
,
iter_seeds()
,
lasso_vars()
,
model_metrics()
,
msplit()
Examples
data(dft) # Titanic dataset
model_preprocess(dft, "Survived", balance = TRUE)
model_preprocess(dft, "Fare", split = 0.5, scale = TRUE)
model_preprocess(dft, "Pclass", ignore = c("Fare", "Cabin"))
model_preprocess(dft, "Pclass", quiet = TRUE)