ml_prepare_dataset {pysparklyr}R Documentation

Creates the 'label' and 'features' columns

Description

Creates the 'label' and 'features' columns

Usage

ml_prepare_dataset(
  x,
  formula = NULL,
  label = NULL,
  features = NULL,
  label_col = "label",
  features_col = "features",
  keep_original = TRUE,
  ...
)

Arguments

x

A tbl_pyspark object

formula

Used when x is a tbl_spark. R formula.

label

The name of the label column.

features

The name(s) of the feature columns as a character vector.

label_col

Label column name, as a length-one character vector.

features_col

Features column name, as a length-one character vector.

keep_original

Boolean flag that indicates if the output will contain, or not, the original columns from x. Defaults to TRUE.

...

Added for backwards compatibility. Not in use today.

Details

At this time, 'Spark ML Connect', does not include a Vector Assembler transformer. The main thing that this function does, is create a 'Pyspark' array column. Pipelines require a 'label' and 'features' columns. Even though it is is single column in the dataset, the 'features' column will contain all of the predictors insde an array. This function also creates a new 'label' column that copies the outcome variable. This makes it a lot easier to remove the 'label', and 'outcome' columns.

Value

A tbl_pyspark, with either the original columns from x, plus the 'label' and 'features' column, or, the 'label' and 'features' columns only.


[Package pysparklyr version 0.1.4 Index]