mlRforest {mlearning} | R Documentation |
Supervised classification and regression using random forest
Description
Unified (formula-based) interface version of the random forest algorithm
provided by randomForest::randomForest()
.
Usage
mlRforest(train, ...)
ml_rforest(train, ...)
## S3 method for class 'formula'
mlRforest(
formula,
data,
ntree = 500,
mtry,
replace = TRUE,
classwt = NULL,
...,
subset,
na.action
)
## Default S3 method:
mlRforest(
train,
response,
ntree = 500,
mtry,
replace = TRUE,
classwt = NULL,
...
)
## S3 method for class 'mlRforest'
predict(
object,
newdata,
type = c("class", "membership", "both", "vote"),
method = c("direct", "oob", "cv"),
...
)
Arguments
train |
a matrix or data frame with predictors. |
... |
further arguments passed to |
formula |
a formula with left term being the factor variable to predict
(for supervised classification), a vector of numbers (for regression) or
nothing (for unsupervised classification) and the right term with the list
of independent, predictive variables, separated with a plus sign. If the
data frame provided contains only the dependent and independent variables,
one can use the |
data |
a data.frame to use as a training set. |
ntree |
the number of trees to generate (use a value large enough to get at least a few predictions for each input row). Default is 500 trees. |
mtry |
number of variables randomly sampled as candidates at each split. Note that the default values are different for classification (sqrt(p) where p is number of variables in x) and regression (p/3)? |
replace |
sample cases with or without replacement ( |
classwt |
priors of the classes. Need not add up to one. Ignored for regression. |
subset |
index vector with the cases to define the training set in use (this argument must be named, if provided). |
na.action |
function to specify the action to be taken if |
response |
a vector of factor (classification) or numeric (regression),
or |
object |
an mlRforest object |
newdata |
a new dataset with same conformation as the training set (same variables, except may by the class for classification or dependent variable for regression). Usually a test set, or a new dataset to be predicted. |
type |
the type of prediction to return. |
method |
|
Value
ml_rforest()
/mlRforest()
creates an mlRforest, mlearning
object containing the classifier and a lot of additional metadata used by
the functions and methods you can apply to it like predict()
or
cvpredict()
. In case you want to program new functions or extract
specific components, inspect the "unclassed" object using unclass()
.
See Also
mlearning()
, cvpredict()
, confusion()
, also
randomForest::randomForest()
that actually does the classification.
Examples
# Prepare data: split into training set (2/3) and test set (1/3)
data("iris", package = "datasets")
train <- c(1:34, 51:83, 101:133)
iris_train <- iris[train, ]
iris_test <- iris[-train, ]
# One case with missing data in train set, and another case in test set
iris_train[1, 1] <- NA
iris_test[25, 2] <- NA
iris_rf <- ml_rforest(data = iris_train, Species ~ .)
summary(iris_rf)
plot(iris_rf) # Useful to look at the effect of ntree=
# For such a relatively simple case, 50 trees are enough
iris_rf <- ml_rforest(data = iris_train, Species ~ ., ntree = 50)
summary(iris_rf)
predict(iris_rf) # Default type is class
predict(iris_rf, type = "membership")
predict(iris_rf, type = "both")
predict(iris_rf, type = "vote")
# Out-of-bag prediction (unbiased)
predict(iris_rf, method = "oob")
# Self-consistency (always very high for random forest, biased, do not use!)
confusion(iris_rf)
# This one is better
confusion(iris_rf, method = "oob") # Out-of-bag performances
# Cross-validation prediction is also a good choice when there is no test set
predict(iris_rf, method = "cv") # Idem: cvpredict(res)
# Cross-validation for performances estimation
confusion(iris_rf, method = "cv")
# Evaluation of performances using a separate test set
confusion(predict(iris_rf, newdata = iris_test), iris_test$Species)
# Regression using random forest (from ?randomForest)
set.seed(131) # Useful for reproducibility (use a different number each time)
ozone_rf <- ml_rforest(data = airquality, Ozone ~ ., mtry = 3,
importance = TRUE, na.action = na.omit)
summary(ozone_rf)
# Show "importance" of variables: higher value mean more important variables
round(randomForest::importance(ozone_rf), 2)
plot(na.omit(airquality)$Ozone, predict(ozone_rf))
abline(a = 0, b = 1)
# Unsupervised classification using random forest (from ?randomForest)
set.seed(17)
iris_urf <- ml_rforest(train = iris[, -5]) # Use only quantitative data
summary(iris_urf)
randomForest::MDSplot(iris_urf, iris$Species)
plot(stats::hclust(stats::as.dist(1 - iris_urf$proximity),
method = "average"), labels = iris$Species)