R: Preprocessing for phenotype classification via ensemble...

ph_prep {pheble}

R Documentation

Preprocessing for phenotype classification via ensemble learning.

Description

The ph_prep function splits a data frame into training, validation, and test sets, all while ensuring that every class is represented in each dataset. By default, it performs a Principal Component Analysis on the training set data and projects the validation and test data into that space. If a non-linear dimensionality reduction strategy is preferred instead, an autoencoder can be used to extract deep features. Note that the parameters max_mem_size, activation, hidden, dropout_ratio, rate, search, and tune_length are NULL unless an autoencoder, method = "ae", is used. In this case, lists or vectors can be supplied to these parameters (see parameter details) to perform a grid search for the optimal hyperparameter combination. The autoencoder with the lowest reconstruction error is selected as the best model.

Usage

ph_prep(
  df,
  ids_col,
  class_col,
  vali_pct = 0.15,
  test_pct = 0.15,
  scale = FALSE,
  center = NULL,
  sd = NULL,
  split_seed = 123,
  method = "pca",
  pca_pct = 0.95,
  max_mem_size = "15g",
  port = 54321,
  train_seed = 123,
  hyper_params = list(),
  search = "random",
  tune_length = 100
)

Arguments

`df`	A `data.frame` containing a column of unique ids, a column of classes, and an arbitrary number of `numeric` columns.
`ids_col`	A `character` value for the name of the ids column.
`class_col`	A `character` value for the name of the class column.
`vali_pct`	A `numeric` value for the percentage of training data to use as validation data: 0.15 (default).
`test_pct`	A `numeric` value for the percentage of total data to use as test data: 0.15 (default).
`scale`	A `logical` value for whether to scale the data: FALSE (default). Recommended if `method = "ae"` and if user intends to train other models.
`center`	Either a `logical` value or numeric-alike vector of length equal to the number of columns of data to scale in `df`, where ‘numeric-alike’ means that as.numeric(.) will be applied successfully if is.numeric(.) is not true: NULL (default). If `scale = TRUE`, this is set to `TRUE` and is used to subtract the mean.
`sd`	Either a `logical` value or a numeric-alike vector of length equal to the number of columns of data to scale in `df`: NULL (default). If `scale = TRUE`, this is set to `TRUE` and is used to divide by the standard deviation.
`split_seed`	A `numeric` value to set the seed and control the randomness of splitting the data: 123 (default).
`method`	A `character` value for the dimensionality reduction method: "pca" (default), "ae", "none".
`pca_pct`	If `method = "pca"`, a `numeric` value for the proportion of variance to subset the PCA with: 0.95 (default).
`max_mem_size`	If `method = "ae"`, a `character` value for the memory of an h2o session: "15g" (default).
`port`	A `numeric` value for the port number of the H2O server.
`train_seed`	A `numeric` value to set the control the randomness of creating resamples: 123 (default).
`hyper_params`	A `list` of hyperparameters to perform a grid search. the "default" list is: list(missing_values_handling = "Skip", activation = c("Rectifier", "Tanh"), hidden = list(5, 25, 50, 100, 250, 500, nrow(df_h2o)), input_dropout_ratio = c(0, 0.1, 0.2, 0.3), rate = c(0, 0.01, 0.005, 0.001)).
`search`	If `method = "ae"`, a `character` value for the hyperparameter search strategy: "random" (default), "grid".
`tune_length`	If `method = "ae"`, a `numeric` value (integer) for either the maximum number of hyperparameter combinations ("random") or individual hyperparameter depth ("grid").

Value

A list containing the following components:

`train_df`	The training set data frame.

`vali_df`	The validation set data frame.

`test_df`	The test set data frame.

`train_split`	The training set indices from the original data frame.

`vali_split`	The validation set indices from the original data frame.

`test_split`	The test set indices from the original data frame.

`vali_pct`	The percentage of training data used as validation data.

`test_pct`	The percentage of total data used as test data.

`method`	The dimensionality reduction method.

Examples

## Import data.
data(ph_crocs)

## Remove anomalies with autoencoder.
rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample",
                      class_col = "Species", method = "ae")
## Preprocess anomaly-free data frame into train, validation, and test sets
## with PCs as predictors.
pc_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample",
                  class_col = "Species", vali_pct = 0.15,
                  test_pct = 0.15, method = "pca")
## Alternatively, preprocess data frame into train, validation, and test
## sets with latent variables as predictors. Notice that port is defined,
## because running H2O sessions one after another can cause connection
## errors.
ae_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample", class_col = "Species",
                  vali_pct = 0.15, test_pct = 0.15, method = "ae", port = 50001)

[Package pheble version 0.1.0 Index]