ml_prepare_dataset {pysparklyr} | R Documentation |
Creates the 'label' and 'features' columns
Description
Creates the 'label' and 'features' columns
Usage
ml_prepare_dataset(
x,
formula = NULL,
label = NULL,
features = NULL,
label_col = "label",
features_col = "features",
keep_original = TRUE,
...
)
Arguments
x |
A |
formula |
Used when |
label |
The name of the label column. |
features |
The name(s) of the feature columns as a character vector. |
label_col |
Label column name, as a length-one character vector. |
features_col |
Features column name, as a length-one character vector. |
keep_original |
Boolean flag that indicates if the output will contain,
or not, the original columns from |
... |
Added for backwards compatibility. Not in use today. |
Details
At this time, 'Spark ML Connect', does not include a Vector Assembler transformer. The main thing that this function does, is create a 'Pyspark' array column. Pipelines require a 'label' and 'features' columns. Even though it is is single column in the dataset, the 'features' column will contain all of the predictors insde an array. This function also creates a new 'label' column that copies the outcome variable. This makes it a lot easier to remove the 'label', and 'outcome' columns.
Value
A tbl_pyspark
, with either the original columns from x
, plus the
'label' and 'features' column, or, the 'label' and 'features' columns only.