R: Classify phenotypes via ensemble learning.

ph_ensemble {pheble}

R Documentation

Classify phenotypes via ensemble learning.

Description

The ph_ensemble function uses classification predictions from a list of algorithms to train an ensemble model. This can be a list of manually trained algorithms from train or, more conveniently, the output from ph_train. The hyperparameter tuning and model evaluations are handled internally to simplify the ensembling process. This function assumes some preprocessing has been performed, hence the training, validation, and test set requirements.

Usage

ph_ensemble(
  train_models,
  train_df,
  vali_df,
  test_df,
  class_col,
  ctrl,
  train_seed = 123,
  n_cores = 2,
  task = "multi",
  metric = ifelse(task == "multi", "Kappa", "ROC"),
  top_models = 3,
  metalearner = ifelse(task == "multi", "glmnet", "rf"),
  tune_length = 10,
  quiet = FALSE
)

Arguments

`train_models`	A `list` of at least two `train` models.
`train_df`	A `data.frame` containing a class column and the training data.
`vali_df`	A `data.frame` containing a class column and the validation data.
`test_df`	A `data.frame` containing a class column and the test data.
`class_col`	A `character` value for the name of the class column. This should be consistent across data frames.
`ctrl`	A `list` containing the resampling strategy (e.g., "boot") and other parameters for `trainControl`. Automatically create one via `ph_ctrl` or manually create it with `trainControl`.
`train_seed`	A `numeric` value to set the training seed and control the randomness of creating resamples: 123 (default).
`n_cores`	An `integer` value for the number of cores to include in the cluster: 2 (default). We highly recommend increasing this value to, e.g., parallel::detectCores() - 1.
`task`	A `character` value for the type of classification `task`: "multi" (default), "binary".
`metric`	A `character` value for which summary metric should be used to select the optimal model: "ROC" (default for "binary") and "Kappa" (default for "multi"). Other options include "logLoss", "Accuracy", "Mean_Balanced_Accuracy", and "Mean_F1".
`top_models`	A `numeric` value for the top n training models to ensemble: 3 (default). Every training model is ordered according to their final metric value (e.g., "ROC" or "Kappa") and the top n models are selected.
`metalearner`	A `character` value for the algorithm used to train the ensemble: "glmnet" (default), "rf". Other methods, such as those listed in ph_train methods, may also be used.
`tune_length`	If `search = "random"` (default), this is an `integer` value for the maximum number of hyperparameter combinations to test for each training model in the ensemble; if `search = "grid"`, this is an `integer` value for the number of levels of each hyperparameter to test for each model.
`quiet`	A `logical` value for whether progress should be printed: TRUE (default), FALSE.

Value

A list containing the following components:

`ensemble_test_preds`	The ensemble predictions for the test set.

`vali_preds`	The validation predictions for the top models.

`test_preds`	The test predictions for the top models.

`all_test_preds`	The test predictions for every successfully trained model.

`all_test_results`	The confusion matrix results obtained from comparing the model test predictions (i.e., original models and ensemble) against the actual test classes.

`ensemble_model`	The ensemble `train` object.

`var_imps`	The ensemble variable importances obtained via weighted averaging. The original train importances are multiplied by the model's importance in the ensemble, then averaged across models and normalized.

`train_df`	The training data frame.

`vali_df`	The validation data frame.

`test_df`	The test data frame.

`train_models`	The `train` models for the ensemble.

`ctrl`	A `trainControl` object.

`metric`	The summary metric used to select the optimal model.

`task`	The type of classification task.

`tune_length`	The maximum number of hyperparameter combinations ("random") or individual hyperparameter depth ("grid").

`top_models`	The number of top methods selected for the ensemble.

`metalearner`	The algorithm used to train the ensemble.

Examples

## Import data.
data(ph_crocs)

## Remove anomalies with autoencoder.
rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample",
                      class_col = "Species", method = "ae")
## Preprocess anomaly-free data frame into train, validation, and test sets
## with PCs as predictors.
pc_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample",
                  class_col = "Species", vali_pct = 0.15,
                  test_pct = 0.15, method = "pca")
## Echo control object for train function.
ctrl <- ph_ctrl(ph_crocs$Species, resample_method = "boot")
## Train all models for ensemble.
## Note: Increasing n_cores will dramatically reduce train time.
train_models <- ph_train(train_df = pc_dfs$train_df,
                         vali_df = pc_dfs$vali_df,
                         test_df = pc_dfs$test_df,
                         class_col = "Species",
                         ctrl = ctrl,
                         task = "multi",
                         methods = "all",
                         tune_length = 5,
                         quiet = FALSE)
## You can also train just a few, although more is preferable.
## Note: Increasing n_cores will dramatically reduce train time.
train_models <- ph_train(train_df = pc_dfs$train_df,
                         vali_df = pc_dfs$vali_df,
                         test_df = pc_dfs$test_df,
                         class_col = "Species",
                         ctrl = ctrl,
                         task = "multi",
                         methods = c("lda", "mda",
                         "nnet", "pda", "sparseLDA"),
                         tune_length = 5,
                         quiet = FALSE)
## Train the ensemble.
## Note: Increasing n_cores will dramatically reduce train time.
ensemble_model <- ph_ensemble(train_models = train_models$train_models,
                              train_df = pc_dfs$train_df,
                              vali_df = pc_dfs$vali_df,
                              test_df = pc_dfs$test_df,
                              class_col = "Species",
                              ctrl = ctrl,
                              task = "multi",
                              top_models = 3,
                              metalearner = "glmnet",
                              tune_length = 25,
                              quiet = FALSE)

[Package pheble version 0.1.0 Index]