R: Optimize parameters to be used in training the final RF model

optimize_RF {multiclassPairs}

R Documentation

Optimize parameters to be used in training the final RF model

Description

optimize_RF takes a different sets of parameters to be used as training parameters. optimize_RF passes each set of the parameters to train_RF function, then optimize_RF returns the accuracies and related measurements (i.e. number of used genes and rules) for each trained RF model based on each set of parameters. Accuracies can be calculated based on the training data or by applying the trained RF model on another testing data.

Usage

optimize_RF(data_object,
            sorted_rules_RF,
            parameters,
            overall = c("Accuracy", "Kappa", "AccuracyLower",
                        "AccuracyUpper", "AccuracyNull", "AccuracyPValue",
                        "McnemarPValue")[1:2],
            byclass = c("Sensitivity", "Specificity",
                        "Pos Pred Value", "Neg Pred Value",
                        "Precision", "Recall", "F1", "Prevalence",
                        "Detection Rate", "Detection Prevalence",
                        "Balanced Accuracy" )[c(11)],
            seed = 123456,
            test_object = NULL,
            impute = TRUE,
            impute_reject = 0.67,
            verbose = FALSE)

Arguments

`data_object`	Data object with labels generated by ReadData function
`sorted_rules_RF`	sorted rules object generated by sort_rules_RF function
`parameters`	a dataframe with the variables that the RF model will be trained based on. Column names should match arguments used in train_RF function. Each row represents one trial (model), e.g. a dataframe with 10 rows means you want to check the performance of 10 different RF models based on 10 different set of parameters.
`overall`	a vector with the names of the overall performance measurements to be reported in the summary table in results. It can be one or more of these measurements: "Accuracy", "Kappa", "AccuracyLower", "AccuracyUpper", "AccuracyNull", "AccuracyPValue", "McnemarPValue". Default is c("Accuracy", "Kappa"). These masurements based on confusionMatrix function output in caret package.
`byclass`	a vector with the names of the performance measurements for individual classes to be reported in the summary table in results. It can be one or more of these measurements: "Sensitivity", "Specificity", "Pos Pred Value", "Neg Pred Value", "Precision", "Recall", "F1", "Prevalence", "Detection Rate", "Detection Prevalence", "Balanced Accuracy". Default is "Balanced Accuracy". These masurements based on confusionMatrix function output in caret package.
`seed`	seed to be used in the training process for reproducibility.
`test_object`	data object with labels generated by ReadData to be used as testing data. If this object is provided then the accuracies and performance results will be based on this object not the training data.
`impute`	logical to be passed to predict_RF when test_object is used. To impute missed genes and NA values in test_object. Default is TRUE.
`impute_reject`	a number between 0 and 1 to be passed to predict_RF when test_object is used. It indicate the threshold of the missed rules in the sample. Based on this threshold the sample will be rejected (i.e. skipped) and the missed rules will not be imputed in this sample. Default is 0.67.
`verbose`	a logical value indicating whether processing messages will be printed or not. Default is FALSE.

Details

optimize_RF helps the user to optimize parameters to be used in train_RF function for a given training dataset.

Value

return optimize_RF_output object which is a list caintains:

`summary`	dataframe contains the input parameters, number of genes and rules in the model, and the selected overall and by class performance measurements. Each trials (i.e. set of parameters) as on row.
`confusionMatrix`	list of confusionMatrix objects generated by caret package, which contains the fulloverall and by class performance for each trial
`errors`	list of errors generated by trials
`calls`	the call which used to generate this object.

Author(s)

Nour-al-dain Marzouka <nour-al-dain.marzouka at med.lu.se>

Examples

# generate random data
Data <- matrix(runif(8000), nrow=100, ncol=80,
               dimnames = list(paste0("G",1:100), paste0("S",1:80)))

# generate random labels
L <- sample(x = c("A","B","C","D"), size = 80, replace = TRUE)

# generate random platform labels
P <- sample(c("P1","P2","P3"), size = 80, replace = TRUE)

# create data object
object <- ReadData(Data = Data,
                   Labels = L,
                   Platform = P,
                   verbose = FALSE)

# sort genes
genes_RF <- sort_genes_RF(data_object = object,
                          seed=123456, verbose = FALSE)

# to get an idea of how many genes we will use
# and how many rules will be generated
# summary_genes_RF(sorted_genes_RF = genes_RF,
#                  genes_altogether = c(10,20,50,100,150,200),
#                  genes_one_vs_rest = c(10,20,50,100,150,200))

# creat and sort rules
# rules_RF <- sort_rules_RF(data_object = object,
#                           sorted_genes_RF = genes_RF,
#                           genes_altogether = 100,
#                           genes_one_vs_rest = 100,
#                           seed=123456,
#                           verbose = FALSE)

# parameters <- data.frame(
#   gene_repetition=c(3,2,1),
#   rules_one_vs_rest=0,
#   rules_altogether=c(2,3,10),
#   run_boruta=c(FALSE,"produce_error",FALSE),
#   plot_boruta = FALSE,
#   num.trees=c(100,200,300),
#   stringsAsFactors = FALSE)
# parameters

# Or you can use expand.grid to generate dataframe with all parameter combinations
# parameters <- expand.grid(
#   gene_repetition=c(3,2,1),
#   rules_one_vs_rest=0,
#   rules_altogether=c(2,3,10),
#   num.trees=c(100,500,1000),
#   stringsAsFactors = FALSE)
# parameters


# test <- optimize_RF(data_object = object,
#                     sorted_rules_RF = rules_RF,
#                     test_object = NULL,
#                     overall = c("Accuracy"),
#                     byclass = NULL, verbose = FALSE,
#                     parameters = parameters)
# test
# test$summary[which.max(test$summary$Accuracy),]
#
# # train the final model
# # it is preferred to increase the number of trees and rules in case you have
# # large number of samples and features
# # for quick example, we have small number of trees and rules here
# # based on the optimize_RF results we will select the parameters
# RF_classifier <- train_RF(data_object = object,
#                           gene_repetition = 1,
#                           rules_altogether = 0,
#                           rules_one_vs_rest = 10,
#                           run_boruta = FALSE,
#                           plot_boruta = FALSE,
#                           probability = TRUE,
#                           num.trees = 300,
#                           sorted_rules_RF = rules_RF,
#                           boruta_args = list(),
#                           verbose = TRUE)
#
# # training accuracy
# # get the prediction labels
# # if the classifier trained using probability	= FALSE
# training_pred <- RF_classifier$RF_scheme$RF_classifier$predictions
# if (is.factor(training_pred)) {
#   x <- as.character(training_pred)
# }
#
# # if the classifier trained using probability	= TRUE
# if (is.matrix(training_pred)) {
#   x <- colnames(training_pred)[max.col(training_pred)]
# }
#
# # training accuracy
# caret::confusionMatrix(data =factor(x),
#                 reference = factor(object$data$Labels),
#                 mode = "everything")

# not to run
# visualize the binary rules in training dataset
# plot_binary_RF(Data = object,
#                classifier = RF_classifier,
#                prediction = NULL, as_training = TRUE,
#                show_scores = TRUE,
#                top_anno = "ref",
#                show_predictions = TRUE,
#                title = "Training data")

# not to run
# Extract and plot the proximity matrix from the classifier for the training data
# it takes long time for large data
# proximity_mat <- proximity_matrix_RF(object = object,
#                       classifier = RF_classifier,
#                       plot=TRUE,
#                       return_matrix=TRUE,
#                       title = "Test",
#                       cluster_cols = TRUE)

# not to run
# predict
# test_object # any test data
# results <- predict_RF(classifier = RF_classifier, impute = TRUE,
#                       Data = test_object)
#
# # visualize the binary rules in training dataset
# plot_binary_RF(Data = test_object,
#                classifier = RF_classifier,
#                prediction = results, as_training = FALSE,
#                show_scores = TRUE,
#                top_anno = "ref",
#                show_predictions = TRUE,
#                title = "Test data")

[Package multiclassPairs version 0.4.3 Index]