R: Generate predictions for phenotype ensemble.

ph_train {pheble}

R Documentation

Generate predictions for phenotype ensemble.

Description

The ph_train function automatically trains a set of binary or multi-class classification models to ultimately build a new dataset of predictions. The data preprocessing and hyperparameter tuning are handled internally to minimize user input and simplify the training.

Usage

ph_train(
  train_df,
  vali_df,
  test_df,
  class_col,
  ctrl,
  train_seed = 123,
  n_cores = 2,
  task = "multi",
  methods = "all",
  metric = ifelse(task == "multi", "Kappa", "ROC"),
  tune_length = 10,
  quiet = FALSE
)

Arguments

`train_df`	A `data.frame` containing a class column and the training data.
`vali_df`	A `data.frame` containing a class column and the validation data.
`test_df`	A `data.frame` containing a class column and the test d
`class_col`	A `character` value for the name of the class column shared across the train, validation, and test sets.
`ctrl`	A `list` containing the resampling strategy (e.g., "boot") and other parameters for `trainControl`. Automatically create one via `ph_ctrl` or manually create it with `trainControl`.
`train_seed`	A `numeric` value to set the training seed and control the randomness of creating resamples: 123 (default).
`n_cores`	An `integer` value for the number of cores to include in the cluster: 2 (default). We highly recommend increasing this value to, e.g., parallel::detectCores() - 1.
`task`	A `character` value for the type of classification `task`: "multi" (default), "binary".
`methods`	A `character` value enumerating the names (at least two, unless "all") of the classification methods to ensemble: "all" (default). If `task = "binary"`, there are 33 methods to choose from: "AdaBag", "AdaBoost.M1", "C5.0", "evtree", "glmnet", "hda", "kernelpls", "kknn", "lda", "loclda", "mda", "nb", "nnet", "pda", "pls", "qda", "rda", "rf", "sparseLDA", "stepLDA", "stepQDA", "treebag", "svmLinear", "svmPoly","svmRadial", "gaussprLinear" (slow), "gaussprPoly" (slow), "gaussprRadial" (slow), "bagEarthGCV", "cforest", "earth", "fda", "hdda". If `task = "multi"`, there are 30 methods to choose from: "AdaBag", "AdaBoost.M1", "C5.0", "evtree", "glmnet", "hda", "kernelpls", "kknn", "lda", "loclda", "mda", "nb", "nnet", "pda", "pls", "qda", "rda", "rf", "sparseLDA", "stepLDA", "stepQDA", "treebag", "svmLinear", "svmPoly", "svmRadial", "bagEarthGCV", "cforest", "earth", "fda", "hdda".
`metric`	A `character` value for which summary metric should be used to select the optimal model: "ROC" (default for "binary") and "Kappa" (default for "multi"). Other options include "logLoss", "Accuracy", "Mean_Balanced_Accuracy", and "Mean_F1".
`tune_length`	If `search = "random"` (default), this is an `integer` value for the maximum number of hyperparameter combinations to test for each training model in the ensemble; if `search = "grid"`, this is an `integer` value for the number of levels of each hyperparameter to test for each model.
`quiet`	A `logical` value for whether progress should be printed: TRUE (default), FALSE.

Value

A list containing the following components:

`train_models`	The `train` models for the ensemble.

`train_df`	The training data frame.

`vali_df`	The validation data frame.

`test_df`	The test data frame.

`task`	The type of classification task.

`ctrl`	A list of resampling parameters used in `trainControl`.

`methods`	The names of the classification methods to ensemble.

`search`	The hyperparameter search strategy.

`n_cores`	The number of cores for parallel processing.

`metric`	The summary metric used to select the optimal model.

`tune_length`	The maximum number of hyperparameter combinations ("random") or individual hyperparameter depth ("grid").

Examples

## Import data.
data(ph_crocs)

## Remove anomalies with autoencoder.
rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample",
                      class_col = "Species", method = "ae")
## Preprocess anomaly-free data frame into train, validation, and test sets
## with PCs as predictors.
pc_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample",
                  class_col = "Species", vali_pct = 0.15,
                  test_pct = 0.15, method = "pca")
## Echo control object for train function.
ctrl <- ph_ctrl(ph_crocs$Species, resample_method = "boot")
## Train all models for ensemble.
## Note: Increasing n_cores will dramatically reduce train time.
train_models <- ph_train(train_df = pc_dfs$train_df,
                         vali_df = pc_dfs$vali_df,
                         test_df = pc_dfs$test_df,
                         class_col = "Species",
                         ctrl = ctrl,
                         task = "multi",
                         methods = "all",
                         tune_length = 5,
                         quiet = FALSE)
## You can also train just a few, although more is preferable.
## Note: Increasing n_cores will dramatically reduce train time.
train_models <- ph_train(train_df = pc_dfs$train_df,
                         vali_df = pc_dfs$vali_df,
                         test_df = pc_dfs$test_df,
                         class_col = "Species",
                         ctrl = ctrl,
                         task = "multi",
                         methods = c("lda", "mda",
                         "nnet", "pda", "sparseLDA"),
                         tune_length = 5,
                         quiet = FALSE)

[Package pheble version 0.1.0 Index]