rfTrain {MSiP} | R Documentation |
rfTrain
Description
The labeled feature matrix can be used as input for Random Forest (RF) classifier. The classifier then assigns each bait-prey pair a confidence score, indicating the level of support for that pair of proteins to interact. Hyperparameter optimization can also be performed to select a set of parameters that maximizes the model's performance. This function also computes the areas under the precision-recall (PR) and ROC curve to evaluate the performance of the classifier.
Usage
rfTrain(
dtInput,
impute = TRUE,
p = 0.3,
parameterTuning = TRUE,
mtry = seq(from = 1, to = 10, by = 2),
min_node_size = seq(from = 1, to = 9, by = 2),
splitrule = c("gini"),
metric = "Accuracy",
resampling.method = "repeatedcv",
iter = 5,
repeats = 5,
pr.plot = TRUE,
roc.plot = TRUE
)
Arguments
dtInput |
Data frame containing instances with class labels |
impute |
Logical value, indicating whether to impute missing values |
p |
The percentage of data that goes to training; defaults to 0.3 |
parameterTuning |
Logical value; indicating whether to tune rf hyper parameters |
mtry |
Number of variables to possibly split at in each node and it is bound by the number of variables in your model |
min_node_size |
Minimal node size |
splitrule |
Splitrule rule for classification: 'gini', 'extratrees' or 'hellinger' with default 'gini' |
metric |
A string that specifies what summary metric will be used to select the optimal model; default to Accuracy |
resampling.method |
The resampling method:'boot', 'boot632', 'optimism_boot', 'boot_all', 'cv', 'repeatedcv', 'LOOCV', 'LGOCV'; defaults to repeatedcv |
iter |
Number of resampling iterations; defaults to 5 |
repeats |
for repeated k-fold cross validation only; defaults to 5 |
pr.plot |
Logical value, indicating whether to plot precision-recall (PR) curve |
roc.plot |
Logical value, indicating whether to plot ROC curve |
Value
Data frame containing a classification results for all instances in the data set, where positive confidence score corresponds to the level of support for the pair of proteins to be true positive, whereas negative score corresponds to the level of support for the pair of proteins to be true negative.
Author(s)
Matineh Rahmatbakhsh, matinerb.94@gmail.com
Examples
data(testdfClassifier)
predidcted_RF <-
rfTrain(testdfClassifier,impute = FALSE, p = 0.3, parameterTuning = FALSE,
mtry = seq(from = 1, to = 5, by = 1),
min_node_size = seq(from = 1, to = 5, by = 1),
splitrule =c("gini"),metric = "Accuracy",
resampling.method = "cv",iter = 2,repeats = 2,
pr.plot = TRUE, roc.plot = FALSE)
head(predidcted_RF)