ph_prep {pheble}R Documentation

Preprocessing for phenotype classification via ensemble learning.

Description

The ph_prep function splits a data frame into training, validation, and test sets, all while ensuring that every class is represented in each dataset. By default, it performs a Principal Component Analysis on the training set data and projects the validation and test data into that space. If a non-linear dimensionality reduction strategy is preferred instead, an autoencoder can be used to extract deep features. Note that the parameters max_mem_size, activation, hidden, dropout_ratio, rate, search, and tune_length are NULL unless an autoencoder, method = "ae", is used. In this case, lists or vectors can be supplied to these parameters (see parameter details) to perform a grid search for the optimal hyperparameter combination. The autoencoder with the lowest reconstruction error is selected as the best model.

Usage

ph_prep(
  df,
  ids_col,
  class_col,
  vali_pct = 0.15,
  test_pct = 0.15,
  scale = FALSE,
  center = NULL,
  sd = NULL,
  split_seed = 123,
  method = "pca",
  pca_pct = 0.95,
  max_mem_size = "15g",
  port = 54321,
  train_seed = 123,
  hyper_params = list(),
  search = "random",
  tune_length = 100
)

Arguments

df

A data.frame containing a column of unique ids, a column of classes, and an arbitrary number of numeric columns.

ids_col

A character value for the name of the ids column.

class_col

A character value for the name of the class column.

vali_pct

A numeric value for the percentage of training data to use as validation data: 0.15 (default).

test_pct

A numeric value for the percentage of total data to use as test data: 0.15 (default).

scale

A logical value for whether to scale the data: FALSE (default). Recommended if method = "ae" and if user intends to train other models.

center

Either a logical value or numeric-alike vector of length equal to the number of columns of data to scale in df, where ‘numeric-alike’ means that as.numeric(.) will be applied successfully if is.numeric(.) is not true: NULL (default). If scale = TRUE, this is set to TRUE and is used to subtract the mean.

sd

Either a logical value or a numeric-alike vector of length equal to the number of columns of data to scale in df: NULL (default). If scale = TRUE, this is set to TRUE and is used to divide by the standard deviation.

split_seed

A numeric value to set the seed and control the randomness of splitting the data: 123 (default).

method

A character value for the dimensionality reduction method: "pca" (default), "ae", "none".

pca_pct

If method = "pca", a numeric value for the proportion of variance to subset the PCA with: 0.95 (default).

max_mem_size

If method = "ae", a character value for the memory of an h2o session: "15g" (default).

port

A numeric value for the port number of the H2O server.

train_seed

A numeric value to set the control the randomness of creating resamples: 123 (default).

hyper_params

A list of hyperparameters to perform a grid search. the "default" list is: list(missing_values_handling = "Skip", activation = c("Rectifier", "Tanh"), hidden = list(5, 25, 50, 100, 250, 500, nrow(df_h2o)), input_dropout_ratio = c(0, 0.1, 0.2, 0.3), rate = c(0, 0.01, 0.005, 0.001)).

search

If method = "ae", a character value for the hyperparameter search strategy: "random" (default), "grid".

tune_length

If method = "ae", a numeric value (integer) for either the maximum number of hyperparameter combinations ("random") or individual hyperparameter depth ("grid").

Value

A list containing the following components:

train_df The training set data frame.
vali_df The validation set data frame.
test_df The test set data frame.
train_split The training set indices from the original data frame.
vali_split The validation set indices from the original data frame.
test_split The test set indices from the original data frame.
vali_pct The percentage of training data used as validation data.
test_pct The percentage of total data used as test data.
method The dimensionality reduction method.

Examples

## Import data.
data(ph_crocs)

## Remove anomalies with autoencoder.
rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample",
                      class_col = "Species", method = "ae")
## Preprocess anomaly-free data frame into train, validation, and test sets
## with PCs as predictors.
pc_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample",
                  class_col = "Species", vali_pct = 0.15,
                  test_pct = 0.15, method = "pca")
## Alternatively, preprocess data frame into train, validation, and test
## sets with latent variables as predictors. Notice that port is defined,
## because running H2O sessions one after another can cause connection
## errors.
ae_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample", class_col = "Species",
                  vali_pct = 0.15, test_pct = 0.15, method = "ae", port = 50001)


[Package pheble version 0.1.0 Index]