sort_rules_RF {multiclassPairs} | R Documentation |
Create and sort feature/gene pairs for pair-based random forest classifier training step
Description
sort_rules_RF uses
random forest to create and sort genes/features pairs prior of downstream random forest model training step.
Usage
sort_rules_RF(data_object,
sorted_genes_RF,
genes_altogether = 50,
genes_one_vs_rest = 50,
run_altogether = TRUE,
run_one_vs_rest = TRUE,
platform_wise = FALSE,
num.trees = 500,
min.node.size = 1,
importance = "impurity",
write.forest = FALSE,
keep.inbag = FALSE,
verbose = TRUE, ...)
Arguments
data_object |
data object generated by ReadData function. Object contains the data and labels. |
sorted_genes_RF |
RandomForest_sorted_genes object created by sort_genes_RF function |
genes_altogether |
integer indicates how many top gene/features should be used altogether genes. These genes will be pooled with the selected gene from genes_one_vs_rest to create all the possible rules. Default is 200. |
genes_one_vs_rest |
integer indicates how many top gene/features should be used from one-vs-rest genes. These genes will be pooled with the selected gene from genes_altogether to create all the possible rules. Default is 200. |
run_altogether |
logical indicates if altogether RF model should be performed to sort the rules based in their importance in all classes together. Default is TRUE. |
run_one_vs_rest |
logical indicates if one_vs_rest RF model for each class should be performed to sort the rules based in their importance in each class. Default is TRUE. |
platform_wise |
logical indicates if the rules importance should be calculated in each platform seperatly then combined based on the lowest importance value (i.e. a rule with low importance in any platform will not be prioritized). Default is FALSE. see details for more description. |
num.trees |
an integer. Number of trees. Default is 500. It is recommended to increase num.trees in case of having large number of features (ranger function argument). |
min.node.size |
an integer. Minimal node size. Default is 1. (ranger function argument) |
importance |
Variable importance mode, should be one of 'impurity', 'impurity_corrected', 'permutation'. Defualt is 'impurity' (ranger function argument) |
write.forest |
Save ranger.forest object, required for prediction. Default is FALSE to reduce memory. (ranger function argument) |
keep.inbag |
Save how often observations are in-bag in each tree. Default is FALSE. (ranger function argument) |
verbose |
a logical value indicating whether processing messages will be printed or not. Default is TRUE. |
... |
any additional arguments to be passed to ranger function (i.e. random forest function) in ranger package. For example, seed for reproducibility. |
Details
In case of class imbalance rules_one_vs_rest=TRUE is recommended.
For platform-wise option. When platform_wise=TRUE, for example, if data has three platforms (i.e. P1, P2, and P3), and random forest was performed for class 1 (C1) versus rest in each platform seperatly, then C1 will have 3 importance lists contain the rules sorted based on P1-P3, rules will be sorted and ranked in each list (lower rank number means higher importance), the combined final sorting will be determined by the lowest importance level in the lists, it means a rule with (5,5,5) will be prioritized over a rule with (1,1,6). And this is applied on the altogether sorting and one-vs-rest sorting. Other combining methods could be added in the future.
Value
returns RandomForest_sorted_rules object which contains sorted rules based on the importance in each class (one-vs-rest) sorting and based on altogether sorting. Also it contains the random forest objects those used in the sorting.
Author(s)
Nour-al-dain Marzouka <nour-al-dain.marzouka at med.lu.se>
Examples
# generate random data
Data <- matrix(runif(8000), nrow=100, ncol=80,
dimnames = list(paste0("G",1:100), paste0("S",1:80)))
# generate random labels
L <- sample(x = c("A","B","C","D"), size = 80, replace = TRUE)
# generate random platform labels
P <- sample(c("P1","P2","P3"), size = 80, replace = TRUE)
# create data object
object <- ReadData(Data = Data,
Labels = L,
Platform = P,
verbose = FALSE)
# sort genes
genes_RF <- sort_genes_RF(data_object = object,
seed=123456, verbose = FALSE)
# to get an idea of how many genes we will use
# and how many rules will be generated
# summary_genes_RF(sorted_genes_RF = genes_RF,
# genes_altogether = c(10,20,50,100,150,200),
# genes_one_vs_rest = c(10,20,50,100,150,200))
# creat and sort rules
# rules_RF <- sort_rules_RF(data_object = object,
# sorted_genes_RF = genes_RF,
# genes_altogether = 100,
# genes_one_vs_rest = 100,
# seed=123456,
# verbose = FALSE)
# parameters <- data.frame(
# gene_repetition=c(3,2,1),
# rules_one_vs_rest=0,
# rules_altogether=c(2,3,10),
# run_boruta=c(FALSE,"produce_error",FALSE),
# plot_boruta = FALSE,
# num.trees=c(100,200,300),
# stringsAsFactors = FALSE)
# parameters
# Or you can use expand.grid to generate dataframe with all parameter combinations
# parameters <- expand.grid(
# gene_repetition=c(3,2,1),
# rules_one_vs_rest=0,
# rules_altogether=c(2,3,10),
# num.trees=c(100,500,1000),
# stringsAsFactors = FALSE)
# parameters
# test <- optimize_RF(data_object = object,
# sorted_rules_RF = rules_RF,
# test_object = NULL,
# overall = c("Accuracy"),
# byclass = NULL, verbose = FALSE,
# parameters = parameters)
# test
# test$summary[which.max(test$summary$Accuracy),]
#
# # train the final model
# # it is preferred to increase the number of trees and rules in case you have
# # large number of samples and features
# # for quick example, we have small number of trees and rules here
# # based on the optimize_RF results we will select the parameters
# RF_classifier <- train_RF(data_object = object,
# gene_repetition = 1,
# rules_altogether = 0,
# rules_one_vs_rest = 10,
# run_boruta = FALSE,
# plot_boruta = FALSE,
# probability = TRUE,
# num.trees = 300,
# sorted_rules_RF = rules_RF,
# boruta_args = list(),
# verbose = TRUE)
#
# # training accuracy
# # get the prediction labels
# # if the classifier trained using probability = FALSE
# training_pred <- RF_classifier$RF_scheme$RF_classifier$predictions
# if (is.factor(training_pred)) {
# x <- as.character(training_pred)
# }
#
# # if the classifier trained using probability = TRUE
# if (is.matrix(training_pred)) {
# x <- colnames(training_pred)[max.col(training_pred)]
# }
#
# # training accuracy
# caret::confusionMatrix(data =factor(x),
# reference = factor(object$data$Labels),
# mode = "everything")
# not to run
# visualize the binary rules in training dataset
# plot_binary_RF(Data = object,
# classifier = RF_classifier,
# prediction = NULL, as_training = TRUE,
# show_scores = TRUE,
# top_anno = "ref",
# show_predictions = TRUE,
# title = "Training data")
# not to run
# Extract and plot the proximity matrix from the classifier for the training data
# it takes long time for large data
# proximity_mat <- proximity_matrix_RF(object = object,
# classifier = RF_classifier,
# plot=TRUE,
# return_matrix=TRUE,
# title = "Test",
# cluster_cols = TRUE)
# not to run
# predict
# test_object # any test data
# results <- predict_RF(classifier = RF_classifier, impute = TRUE,
# Data = test_object)
#
# # visualize the binary rules in training dataset
# plot_binary_RF(Data = test_object,
# classifier = RF_classifier,
# prediction = results, as_training = FALSE,
# show_scores = TRUE,
# top_anno = "ref",
# show_predictions = TRUE,
# title = "Test data")