sort_genes_RF {multiclassPairs} | R Documentation |
Sort genes/features for pair-based random forest classifier downstream steps
Description
sort_genes_RF
uses random forest to sort genes/features prior of downstream steps such as gene pairs/rules selection which will involve random forest models.
Usage
sort_genes_RF(data_object,
featureNo_altogether,
featureNo_one_vs_rest,
rank_data = FALSE,
platform_wise = FALSE,
num.trees = 500,
min.node.size = 1,
importance = "impurity",
write.forest = FALSE,
keep.inbag = FALSE,
verbose = TRUE, ...)
Arguments
data_object |
data object generated by ReadData function. Object contains the data and labels. |
featureNo_altogether |
an integer. Optional. Indicating specific number of top sorted genes to be returned from one random forest model contains all the labels together. If 0 then this sorting will be skipped. By default, if no number is specified then all available genes will be sorted and returned because user can specify how many top genes will be used in the downstream analysis. |
featureNo_one_vs_rest |
an integer. Optional. Indicating specific number of top sorted genes to be returned from 'one vs rest' random forest models. This means each class will have a random forest where the samples from the other classes will be labels as 'rest'. If 0 then this sorting will be skipped. By default, if no number is specified then all available genes will be sorted and returned because user can specify how many top genes will be used in the downstream analysis. |
rank_data |
logical indicates if the data should be ranked (features will be ranked inside each sample). Default is FALSE. |
platform_wise |
logical indicates if the gene importance should be calculated in each platform seperatly then combined based on the lowest importance value (i.e. a gene with low importance in any platform will not be prioritized). Default is FALSE. see details for more description. |
num.trees |
an integer. Number of trees. Default is 500. It is recommended to increase num.trees in case of having large number of features (ranger function argument). |
min.node.size |
an integer. Minimal node size. Default is 1. (ranger function argument) |
importance |
Variable importance mode, should be one of 'impurity', 'impurity_corrected', 'permutation'. Defualt is 'impurity' (ranger function argument) |
write.forest |
Save ranger.forest object, required for prediction. Default is FALSE to reduce memory. (ranger function argument) |
keep.inbag |
Save how often observations are in-bag in each tree. Default is FALSE. (ranger function argument) |
verbose |
a logical value indicating whether processing messages will be printed or not. Default is TRUE. |
... |
any additional arguments to be passed to ranger function (i.e. random forest function) in ranger package. For example, seed for reproducibility. |
Details
For platform-wise option. When platform_wise=TRUE, for example, if data has three platforms (i.e. P1, P2, and P3), and random forest was performed for class 1 (C1) versus rest in each platform seperatly, then C1 will have 3 importance lists contain the genes sorted based on P1-P3, genes will be sorted and ranked in each list (lower rank number means higher importance), the combined final sorting will be determined by the lowest importance level in the lists, it means a gene with (5,5,5) will be prioritized over a gene with (1,1,6). And this is applied on the altogether sorting and one-vs-rest sorting. Other combining methods could be added in the future.
Value
returns RandomForest_sorted_genes object which contains sorted genes based on the importance in each class (one-vs-rest) sorting and based altogether sorting. Also it contains the random forest objects those used in the sorting.
Author(s)
Nour-al-dain Marzouka <nour-al-dain.marzouka at med.lu.se>
Examples
# generate random data
Data <- matrix(runif(8000), nrow=100, ncol=80,
dimnames = list(paste0("G",1:100), paste0("S",1:80)))
# generate random labels
L <- sample(x = c("A","B","C","D"), size = 80, replace = TRUE)
# generate random platform labels
P <- sample(c("P1","P2","P3"), size = 80, replace = TRUE)
# create data object
object <- ReadData(Data = Data,
Labels = L,
Platform = P,
verbose = FALSE)
# sort genes
genes_RF <- sort_genes_RF(data_object = object,
seed=123456, verbose = FALSE)
# to get an idea of how many genes we will use
# and how many rules will be generated
# summary_genes_RF(sorted_genes_RF = genes_RF,
# genes_altogether = c(10,20,50,100,150,200),
# genes_one_vs_rest = c(10,20,50,100,150,200))
# creat and sort rules
# rules_RF <- sort_rules_RF(data_object = object,
# sorted_genes_RF = genes_RF,
# genes_altogether = 100,
# genes_one_vs_rest = 100,
# seed=123456,
# verbose = FALSE)
# parameters <- data.frame(
# gene_repetition=c(3,2,1),
# rules_one_vs_rest=0,
# rules_altogether=c(2,3,10),
# run_boruta=c(FALSE,"produce_error",FALSE),
# plot_boruta = FALSE,
# num.trees=c(100,200,300),
# stringsAsFactors = FALSE)
# parameters
# Or you can use expand.grid to generate dataframe with all parameter combinations
# parameters <- expand.grid(
# gene_repetition=c(3,2,1),
# rules_one_vs_rest=0,
# rules_altogether=c(2,3,10),
# num.trees=c(100,500,1000),
# stringsAsFactors = FALSE)
# parameters
# test <- optimize_RF(data_object = object,
# sorted_rules_RF = rules_RF,
# test_object = NULL,
# overall = c("Accuracy"),
# byclass = NULL, verbose = FALSE,
# parameters = parameters)
# test
# test$summary[which.max(test$summary$Accuracy),]
#
# # train the final model
# # it is preferred to increase the number of trees and rules in case you have
# # large number of samples and features
# # for quick example, we have small number of trees and rules here
# # based on the optimize_RF results we will select the parameters
# RF_classifier <- train_RF(data_object = object,
# gene_repetition = 1,
# rules_altogether = 0,
# rules_one_vs_rest = 10,
# run_boruta = FALSE,
# plot_boruta = FALSE,
# probability = TRUE,
# num.trees = 300,
# sorted_rules_RF = rules_RF,
# boruta_args = list(),
# verbose = TRUE)
#
# # training accuracy
# # get the prediction labels
# # if the classifier trained using probability = FALSE
# training_pred <- RF_classifier$RF_scheme$RF_classifier$predictions
# if (is.factor(training_pred)) {
# x <- as.character(training_pred)
# }
#
# # if the classifier trained using probability = TRUE
# if (is.matrix(training_pred)) {
# x <- colnames(training_pred)[max.col(training_pred)]
# }
#
# # training accuracy
# caret::confusionMatrix(data =factor(x),
# reference = factor(object$data$Labels),
# mode = "everything")
# not to run
# visualize the binary rules in training dataset
# plot_binary_RF(Data = object,
# classifier = RF_classifier,
# prediction = NULL, as_training = TRUE,
# show_scores = TRUE,
# top_anno = "ref",
# show_predictions = TRUE,
# title = "Training data")
# not to run
# Extract and plot the proximity matrix from the classifier for the training data
# it takes long time for large data
# proximity_mat <- proximity_matrix_RF(object = object,
# classifier = RF_classifier,
# plot=TRUE,
# return_matrix=TRUE,
# title = "Test",
# cluster_cols = TRUE)
# not to run
# predict
# test_object # any test data
# results <- predict_RF(classifier = RF_classifier, impute = TRUE,
# Data = test_object)
#
# # visualize the binary rules in training dataset
# plot_binary_RF(Data = test_object,
# classifier = RF_classifier,
# prediction = results, as_training = FALSE,
# show_scores = TRUE,
# top_anno = "ref",
# show_predictions = TRUE,
# title = "Test data")