summary_genes_RF {multiclassPairs} | R Documentation |
Summarize sorted genes to rules
Description
After sorting genes RF by sort_genes_RF
function summary_genes_RF
gives an idea of how many genes you need to use to generate specific number of rules in sort_rules_RF
function.
Usage
summary_genes_RF(sorted_genes_RF,
genes_altogether,
genes_one_vs_rest)
Arguments
sorted_genes_RF |
sorted genes object with class |
genes_altogether |
numeric vector indicating how many genes from altogether slot (i.e. 'all') should be used each time. |
genes_one_vs_rest |
numeric vector indicating how many genes from one_vs_rest slots (i.e. per class) should be used each time. |
Details
summary_genes_RF
function helps the user to know which number of genes should be used to get the needed number of rules in sort_rules_RF
function. NOTE: without consideration of gene replication in rules, because the rules are not sorted yet. summary_genes_RF
workes as follows: take the first element in genes_altogether
and genes_one_vs_rest
, then bring this number of top genes from altogether slot and one_vs_rest slots (this number of genes will be taken from each class), respectively, from the sorted_genes_RF
object. Then pool the extracted genes. Then take the unique genes. Then calculate the number of the possible combinations. Store the number of unique genes and rules in first row in the output dataframe then pick the second element from the genes_altogether
and genes_one_vs_rest
and repeat the steps again.
Value
returns a dataframe with the used paramerters and the expected number of unique genes and rules. Number of rows of the dataframe equals the length of genes_altogether
and genes_one_vs_rest
.
Author(s)
Nour-al-dain Marzouka <nour-al-dain.marzouka at med.lu.se>
Examples
# generate random data
Data <- matrix(runif(8000), nrow=100, ncol=80,
dimnames = list(paste0("G",1:100), paste0("S",1:80)))
# generate random labels
L <- sample(x = c("A","B","C","D"), size = 80, replace = TRUE)
# generate random platform labels
P <- sample(c("P1","P2","P3"), size = 80, replace = TRUE)
# create data object
object <- ReadData(Data = Data,
Labels = L,
Platform = P,
verbose = FALSE)
# sort genes
genes_RF <- sort_genes_RF(data_object = object,
seed=123456, verbose = FALSE)
# to get an idea of how many genes we will use
# and how many rules will be generated
# summary_genes_RF(sorted_genes_RF = genes_RF,
# genes_altogether = c(10,20,50,100,150,200),
# genes_one_vs_rest = c(10,20,50,100,150,200))
# creat and sort rules
# rules_RF <- sort_rules_RF(data_object = object,
# sorted_genes_RF = genes_RF,
# genes_altogether = 100,
# genes_one_vs_rest = 100,
# seed=123456,
# verbose = FALSE)
# parameters <- data.frame(
# gene_repetition=c(3,2,1),
# rules_one_vs_rest=0,
# rules_altogether=c(2,3,10),
# run_boruta=c(FALSE,"produce_error",FALSE),
# plot_boruta = FALSE,
# num.trees=c(100,200,300),
# stringsAsFactors = FALSE)
# parameters
# Or you can use expand.grid to generate dataframe with all parameter combinations
# parameters <- expand.grid(
# gene_repetition=c(3,2,1),
# rules_one_vs_rest=0,
# rules_altogether=c(2,3,10),
# num.trees=c(100,500,1000),
# stringsAsFactors = FALSE)
# parameters
# test <- optimize_RF(data_object = object,
# sorted_rules_RF = rules_RF,
# test_object = NULL,
# overall = c("Accuracy"),
# byclass = NULL, verbose = FALSE,
# parameters = parameters)
# test
# test$summary[which.max(test$summary$Accuracy),]
#
# # train the final model
# # it is preferred to increase the number of trees and rules in case you have
# # large number of samples and features
# # for quick example, we have small number of trees and rules here
# # based on the optimize_RF results we will select the parameters
# RF_classifier <- train_RF(data_object = object,
# gene_repetition = 1,
# rules_altogether = 0,
# rules_one_vs_rest = 10,
# run_boruta = FALSE,
# plot_boruta = FALSE,
# probability = TRUE,
# num.trees = 300,
# sorted_rules_RF = rules_RF,
# boruta_args = list(),
# verbose = TRUE)
#
# # training accuracy
# # get the prediction labels
# # if the classifier trained using probability = FALSE
# training_pred <- RF_classifier$RF_scheme$RF_classifier$predictions
# if (is.factor(training_pred)) {
# x <- as.character(training_pred)
# }
#
# # if the classifier trained using probability = TRUE
# if (is.matrix(training_pred)) {
# x <- colnames(training_pred)[max.col(training_pred)]
# }
#
# # training accuracy
# caret::confusionMatrix(data =factor(x),
# reference = factor(object$data$Labels),
# mode = "everything")
# not to run
# visualize the binary rules in training dataset
# plot_binary_RF(Data = object,
# classifier = RF_classifier,
# prediction = NULL, as_training = TRUE,
# show_scores = TRUE,
# top_anno = "ref",
# show_predictions = TRUE,
# title = "Training data")
# not to run
# Extract and plot the proximity matrix from the classifier for the training data
# it takes long time for large data
# proximity_mat <- proximity_matrix_RF(object = object,
# classifier = RF_classifier,
# plot=TRUE,
# return_matrix=TRUE,
# title = "Test",
# cluster_cols = TRUE)
# not to run
# predict
# test_object # any test data
# results <- predict_RF(classifier = RF_classifier, impute = TRUE,
# Data = test_object)
#
# # visualize the binary rules in training dataset
# plot_binary_RF(Data = test_object,
# classifier = RF_classifier,
# prediction = results, as_training = FALSE,
# show_scores = TRUE,
# top_anno = "ref",
# show_predictions = TRUE,
# title = "Test data")