mlearning {mlearning} | R Documentation |
Machine learning model for (un)supervised classification or regression
Description
An mlearning object provides an unified (formula-based) interface to
several machine learning algorithms. They share the same interface and very
similar arguments. They conform to the formula-based approach, of say,
stats::lm()
in base R, but with a coherent handling of missing data and
missing class levels. An optimized version exists for the simplified y ~ .
formula. Finally, cross-validation is also built-in.
Usage
mlearning(
formula,
data,
method,
model.args,
call = match.call(),
...,
subset,
na.action = na.fail
)
## S3 method for class 'mlearning'
print(x, ...)
## S3 method for class 'mlearning'
summary(object, ...)
## S3 method for class 'summary.mlearning'
print(x, ...)
## S3 method for class 'mlearning'
plot(x, y, ...)
## S3 method for class 'mlearning'
predict(
object,
newdata,
type = c("class", "membership", "both"),
method = c("direct", "cv"),
na.action = na.exclude,
...
)
cvpredict(object, ...)
## S3 method for class 'mlearning'
cvpredict(
object,
type = c("class", "membership", "both"),
cv.k = 10,
cv.strat = TRUE,
...
)
Arguments
formula |
a formula with left term being the factor variable to predict
(for supervised classification), a vector of numbers (for regression) or
nothing (for unsupervised classification) and the right term with the list
of independent, predictive variables, separated with a plus sign. If the
data frame provided contains only the dependent and independent variables,
one can use the |
data |
a data.frame to use as a training set. |
method |
|
model.args |
arguments for formula modeling with substituted data and subset... Not to be used by the end-user. |
call |
the function call. Not to be used by the end-user. |
... |
further arguments (depends on the method). |
subset |
index vector with the cases to define the training set in use (this argument must be named, if provided). |
na.action |
function to specify the action to be taken if |
x , object |
an mlearning object |
y |
a second mlearning object or nothing (not used in several plots) |
newdata |
a new dataset with same conformation as the training set (same variables, except may by the class for classification or dependent variable for regression). Usually a test set, or a new dataset to be predicted. |
type |
the type of prediction to return. |
cv.k |
k for k-fold cross-validation, cf |
cv.strat |
is the subsampling stratified or not in cross-validation,
cf |
Value
an mlearning object for mlearning()
. Methods return their own
results that can be a mlearning, data.frame, vector, etc.
See Also
ml_lda()
, ml_qda()
, ml_naive_bayes()
, ml_nnet()
,
ml_rpart()
, ml_rforest()
, ml_svm()
, confusion()
and prior()
. Also
ipred::errorest()
that internally computes the cross-validation
in cvpredict()
.
Examples
# mlearning() should not be calle directly. Use the mlXXX() functions instead
# for instance, for Random Forest, use ml_rforest()/mlRforest()
# A typical classification involves several steps:
#
# 1) Prepare data: split into training set (2/3) and test set (1/3)
# Data cleaning (elimination of unwanted variables), transformation of
# others (scaling, log, ratios, numeric to factor, ...) may be necessary
# here. Apply the same treatments on the training and test sets
data("iris", package = "datasets")
train <- c(1:34, 51:83, 101:133) # Also random or stratified sampling
iris_train <- iris[train, ]
iris_test <- iris[-train, ]
# 2) Train the classifier, use of the simplified formula class ~ . encouraged
# so, you may have to prepare the train/test sets to keep only relevant
# variables and to possibly transform them before use
iris_rf <- ml_rforest(data = iris_train, Species ~ .)
iris_rf
summary(iris_rf)
train(iris_rf)
response(iris_rf)
# 3) Find optimal values for the parameters of the model
# This is usally done iteratively. Just an example with ntree where a plot
# exists to help finding optimal value
plot(iris_rf)
# For such a relatively simple case, 50 trees are enough, retrain with it
iris_rf <- ml_rforest(data = iris_train, Species ~ ., ntree = 50)
summary(iris_rf)
# 4) Study the classifier performances. Several metrics and tools exists
# like ROC curves, AUC, etc. Tools provided here are the confusion matrix
# and the metrics that are calculated on it.
predict(iris_rf) # Default type is class
predict(iris_rf, type = "membership")
predict(iris_rf, type = "both")
# Confusion matrice and metrics using 10-fols cross-validation
iris_rf_conf <- confusion(iris_rf, method = "cv")
iris_rf_conf
summary(iris_rf_conf)
# Note you may want to manipulate priors too, see ?prior
# 5) Go back to step #1 and refine the process until you are happy with the
# results. Then, you can use the classifier to predict unknown items.