ph_ensemble {pheble}R Documentation

Classify phenotypes via ensemble learning.

Description

The ph_ensemble function uses classification predictions from a list of algorithms to train an ensemble model. This can be a list of manually trained algorithms from train or, more conveniently, the output from ph_train. The hyperparameter tuning and model evaluations are handled internally to simplify the ensembling process. This function assumes some preprocessing has been performed, hence the training, validation, and test set requirements.

Usage

ph_ensemble(
  train_models,
  train_df,
  vali_df,
  test_df,
  class_col,
  ctrl,
  train_seed = 123,
  n_cores = 2,
  task = "multi",
  metric = ifelse(task == "multi", "Kappa", "ROC"),
  top_models = 3,
  metalearner = ifelse(task == "multi", "glmnet", "rf"),
  tune_length = 10,
  quiet = FALSE
)

Arguments

train_models

A list of at least two train models.

train_df

A data.frame containing a class column and the training data.

vali_df

A data.frame containing a class column and the validation data.

test_df

A data.frame containing a class column and the test data.

class_col

A character value for the name of the class column. This should be consistent across data frames.

ctrl

A list containing the resampling strategy (e.g., "boot") and other parameters for trainControl. Automatically create one via ph_ctrl or manually create it with trainControl.

train_seed

A numeric value to set the training seed and control the randomness of creating resamples: 123 (default).

n_cores

An integer value for the number of cores to include in the cluster: 2 (default). We highly recommend increasing this value to, e.g., parallel::detectCores() - 1.

task

A character value for the type of classification task: "multi" (default), "binary".

metric

A character value for which summary metric should be used to select the optimal model: "ROC" (default for "binary") and "Kappa" (default for "multi"). Other options include "logLoss", "Accuracy", "Mean_Balanced_Accuracy", and "Mean_F1".

top_models

A numeric value for the top n training models to ensemble: 3 (default). Every training model is ordered according to their final metric value (e.g., "ROC" or "Kappa") and the top n models are selected.

metalearner

A character value for the algorithm used to train the ensemble: "glmnet" (default), "rf". Other methods, such as those listed in ph_train methods, may also be used.

tune_length

If search = "random" (default), this is an integer value for the maximum number of hyperparameter combinations to test for each training model in the ensemble; if search = "grid", this is an integer value for the number of levels of each hyperparameter to test for each model.

quiet

A logical value for whether progress should be printed: TRUE (default), FALSE.

Value

A list containing the following components:

ensemble_test_preds The ensemble predictions for the test set.
vali_preds The validation predictions for the top models.
test_preds The test predictions for the top models.
all_test_preds The test predictions for every successfully trained model.
all_test_results The confusion matrix results obtained from comparing the model test predictions (i.e., original models and ensemble) against the actual test classes.
ensemble_model The ensemble train object.
var_imps The ensemble variable importances obtained via weighted averaging. The original train importances are multiplied by the model's importance in the ensemble, then averaged across models and normalized.
train_df The training data frame.
vali_df The validation data frame.
test_df The test data frame.
train_models The train models for the ensemble.
ctrl A trainControl object.
metric The summary metric used to select the optimal model.
task The type of classification task.
tune_length The maximum number of hyperparameter combinations ("random") or individual hyperparameter depth ("grid").
top_models The number of top methods selected for the ensemble.
metalearner The algorithm used to train the ensemble.

Examples

## Import data.
data(ph_crocs)

## Remove anomalies with autoencoder.
rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample",
                      class_col = "Species", method = "ae")
## Preprocess anomaly-free data frame into train, validation, and test sets
## with PCs as predictors.
pc_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample",
                  class_col = "Species", vali_pct = 0.15,
                  test_pct = 0.15, method = "pca")
## Echo control object for train function.
ctrl <- ph_ctrl(ph_crocs$Species, resample_method = "boot")
## Train all models for ensemble.
## Note: Increasing n_cores will dramatically reduce train time.
train_models <- ph_train(train_df = pc_dfs$train_df,
                         vali_df = pc_dfs$vali_df,
                         test_df = pc_dfs$test_df,
                         class_col = "Species",
                         ctrl = ctrl,
                         task = "multi",
                         methods = "all",
                         tune_length = 5,
                         quiet = FALSE)
## You can also train just a few, although more is preferable.
## Note: Increasing n_cores will dramatically reduce train time.
train_models <- ph_train(train_df = pc_dfs$train_df,
                         vali_df = pc_dfs$vali_df,
                         test_df = pc_dfs$test_df,
                         class_col = "Species",
                         ctrl = ctrl,
                         task = "multi",
                         methods = c("lda", "mda",
                         "nnet", "pda", "sparseLDA"),
                         tune_length = 5,
                         quiet = FALSE)
## Train the ensemble.
## Note: Increasing n_cores will dramatically reduce train time.
ensemble_model <- ph_ensemble(train_models = train_models$train_models,
                              train_df = pc_dfs$train_df,
                              vali_df = pc_dfs$vali_df,
                              test_df = pc_dfs$test_df,
                              class_col = "Species",
                              ctrl = ctrl,
                              task = "multi",
                              top_models = 3,
                              metalearner = "glmnet",
                              tune_length = 25,
                              quiet = FALSE)


[Package pheble version 0.1.0 Index]