R: Powerful function that trains and tests a particular fit...

mining {rminer}

R Documentation

Powerful function that trains and tests a particular fit model under several runs and a given validation method

Description

Powerful function that trains and tests a particular fit model under several runs and a given validation method. Since there can be a huge number of models, the fitted models are not stored. Yet, several useful statistics (e.g. predictions) are returned.

Usage

mining(x, data = NULL, Runs = 1, method = NULL, model = "default", 
       task = "default", search = "heuristic", mpar = NULL,
       feature="none", scale = "default", transform = "none", 
       debug = FALSE, ...)

Arguments

`x`	a symbolic description (formula) of the model to be fit. If `x` contains the data, then `data=NULL` (similar to x in `ksvm`, kernlab package).
`data`	an optional data frame (columns denote attributes, rows show examples) containing the training data, when using a formula.
`Runs`	number of runs used (e.g. 1, 5, 10, 20, 30)
`method`	a vector with c(vmethod,vpar,seed) or c(vmethod,vpar,window,increment), where vmethod is: `all` – all NROW examples are used as both training and test sets (no vpar or seed is needed). `holdout` – standard holdout method. If vpar<1 then NROWvpar random samples are used for training and the remaining rows are used for testing. Else, then NROW*vpar random samples are used for testing and the remaining are used for training. For classification tasks (`prob` or `class`) a stratified sampling is assumed (equal to `mode="stratified"` in `holdout`). `holdoutrandom` – similar to `holdout` except that assumes always a random sampling (not stratified). `holdoutorder` – similar to `holdout` except that instead of a random sampling, the first rows (until the split) are used for training and the remaining ones for testing (equal to `mode="order"` in `holdout`). `holdoutinc` – incremental holdout retraining (e.g. used for stock market data). Here, vpar* is the test size, window is the initial window size and increment is the number of samples added at each iteration. Note: argument `Runs` is automatically set when this option is used. See also `holdout`. `holdoutrol` – rolling holdout retraining (e.g. used for stock market data). Here, vpar is the test size, window is the window size and increment is the number of samples added at each iteration. Note: argument `Runs` is automatically set when this option is used. See also `holdout`. `kfold` – K-fold cross-validation method, where vpar is the number of folds. For classification tasks (`prob` or `class`) a stratified split is assumed (equal to `mode="stratified"` in `crossvaldata`). `kfoldrandom` – similar to `kfold` except that assumes always a random sampling (not stratified). `kfoldorder` – similar to `kfold` except that instead of a random sampling, the order of the rows is used to build the folds. vpar – number used by vmethod (optional, if not defined 2/3 for `holdout` and 10 for `kfold` is assumed); and seed (optional, if not defined then `NA` is assumed) is: `NA` – random seed is adopted (default R method for generating random numbers); a vector of size `Runs` with fixed seed numbers for each Run; a number – `set.seed`(number) is applied then a vector of seeds (of size Runs) is generated.
`model`	See `fit` for details.
`task`	See `fit` for details.
`search`	See `fit` for details.
`mpar`	Only kept for compatibility with previous `rminer` versions, as you should use `search` instead of `mpar`. See `fit` for details.
`feature`	See `fit` for more details about `feature="none"`, `"sabs"` or `"sbs"` options. For the `mining` function, additional options are `feature=`fmethod, where fmethod can be one of: `sens` or `sensg` – compute the 1-D sensitivity analysis input importances (`$sen`), gradient measure. `sensv` – compute the 1-D sensitivity analysis input importances (`$sen`), variance measure. `sensr` – compute the 1-D sensitivity analysis input importances (`$sen`), range measure. `simp`, `simpg` or `s` – equal to `sensg` but also computes the 1-D sensitivity responses (`$sresponses`, useful for `graph="VEC"`). `simpv` – equal to `sensv` but also computes the 1-D sensitivity responses (useful for `graph="VEC"`). `simpr` – equal to `sensr` but also computes the 1-D sensitivity responses (useful for `graph="VEC"`).
`scale`	See `fit` for details.
`transform`	See `fit` for details.
`debug`	If TRUE shows some information about each run.
`...`	See `fit` for details.

Details

Powerful function that trains and tests a particular fit model under several runs and a given validation method (see [Cortez, 2010] for more details).
Several Runs are performed. In each run, the same validation method is adopted (e.g. holdout) and several relevant statistics are stored. Note: this function can require some computational effort, specially if a large dataset and/or a high number of Runs is adopted.

Value

A list with the components:

$object – fitted object values of the last run (used by multiple model fitting: "auto" mode). For "holdout", it is equal to a fit object, while for "kfold" it is a list.
$time – vector with time elapsed for each run.
$test – vector list, where each element contains the test (target) results for each run.
$pred – vector list, where each element contains the predicted results for each test set and each run.
$error – vector with a (validation) measure (often it is a error value) according to search$metric for each run (valid options are explained in mmetric).
$mpar – vector list, where each element contains the fit model mpar parameters (for each run).
$model – the model.
$task – the task.
$method – the external validation method.
$sen – a matrix with the 1-D sensitivity analysis input importances. The number of rows is Runs times vpar, if kfold, else is Runs.
$sresponses – a vector list with a size equal to the number of attributes (useful for graph="VEC"). Each element contains a list with the 1-D sensitivity analysis input responses (n – name of the attribute; l – number of levels; x – attribute values; y – 1-D sensitivity responses.
Important note: sresponses (and "VEC" graphs) are only available if feature="sabs" or "simp" related (see feature).
$runs – the Runs.
$attributes – vector list with all attributes (features) selected in each run (and fold if kfold) if a feature selection algorithm is used.
$feature – the feature.

Note

Author(s)

Paulo Cortez http://www3.dsi.uminho.pt/pcortez/

References

To check for more details about rminer and for citation purposes:
P. Cortez.
Data Mining with Neural Networks and Support Vector Machines Using the R/rminer Tool.
In P. Perner (Ed.), Advances in Data Mining - Applications and Theoretical Aspects 10th Industrial Conference on Data Mining (ICDM 2010), Lecture Notes in Artificial Intelligence 6171, pp. 572-583, Berlin, Germany, July, 2010. Springer. ISBN: 978-3-642-14399-1.
@Springer: https://link.springer.com/chapter/10.1007/978-3-642-14400-4_44
http://www3.dsi.uminho.pt/pcortez/2010-rminer.pdf
This tutorial shows additional code examples:
P. Cortez.
A tutorial on using the rminer R package for data mining tasks.
Teaching Report, Department of Information Systems, ALGORITMI Research Centre, Engineering School, University of Minho, Guimaraes, Portugal, July 2015.
http://hdl.handle.net/1822/36210
For the grid search and other optimization methods:
P. Cortez.
Modern Optimization with R.
Use R! series, Springer, September 2014, ISBN 978-3-319-08262-2.
https://www.springer.com/gp/book/9783319082622

Examples

### dontrun is used when the execution of the example requires some computational effort.

### simple regression example
set.seed(123); x1=rnorm(200,100,20); x2=rnorm(200,100,20)
y=0.7*sin(x1/(25*pi))+0.3*sin(x2/(25*pi))
# mining with an ensemble of neural networks, each fixed with size=2 hidden nodes
# assumes a default holdout (random split) with 2/3 for training and 1/3 for testing:
M=mining(y~x1+x2,Runs=2,model="mlpe",search=2)
print(M)
print(mmetric(M,metric="MAE"))

### more regression examples:
## Not run: 
# simple nonlinear regression task; x3 is a random variable and does not influence y:
data(sin1reg)
# 5 runs of an external holdout with 2/3 for training and 1/3 for testing, fixed seed 12345
# feature selection: sabs method
# model selection: 5 searches for size, internal 2-fold cross validation fixed seed 123
#                  with optimization for minimum MAE metric 
M=mining(y~.,data=sin1reg,Runs=5,method=c("holdout",2/3,12345),model="mlpe",
         search=list(search=mparheuristic("mlpe",n=5),method=c("kfold",2,123),metric="MAE"),
         feature="sabs")
print(mmetric(M,metric="MAE"))
print(M$mpar)
print("median hidden nodes (size) and number of MLPs (nr):")
print(centralpar(M$mpar))
print("attributes used by the model in each run:")
print(M$attributes)
mgraph(M,graph="RSC",Grid=10,main="sin1 MLPE scatter plot")
mgraph(M,graph="REP",Grid=10,main="sin1 MLPE scatter plot",sort=FALSE)
mgraph(M,graph="REC",Grid=10,main="sin1 MLPE REC")
mgraph(M,graph="IMP",Grid=10,main="input importances",xval=0.1,leg=names(sin1reg))
# average influence of x1 on the model:
mgraph(M,graph="VEC",Grid=10,main="x1 VEC curve",xval=1,leg=names(sin1reg)[1])

## End(Not run)

### regression example with holdout rolling windows:
## Not run: 
# simple nonlinear regression task; x3 is a random variable and does not influence y:
data(sin1reg)
# rolling with 20 test samples, training window size of 300 and increment of 50 in each run:
# note that Runs argument is automatically set to 14 in this example:
M=mining(y~.,data=sin1reg,method=c("holdoutrol",20,300,50),
         model="mlpe",debug=TRUE)

## End(Not run)

### regression example with all rminer models:
## Not run: 
# simple nonlinear regression task; x3 is a random variable and does not influence y:
data(sin1reg)
models=c("naive","ctree","rpart","kknn","mlp","mlpe","ksvm","randomForest","mr","mars",
         "cubist","pcr","plsr","cppls","rvm")
for(model in models)
{ 
 M=mining(y~.,data=sin1reg,method=c("holdout",2/3,12345),model=model)
 cat("model:",model,"MAE:",round(mmetric(M,metric="MAE")$MAE,digits=3),"\n")
}

## End(Not run)

### classification example (task="prob")
## Not run: 
data(iris)
# 10 runs of a 3-fold cross validation with fixed seed 123 for generating the 3-fold runs
M=mining(Species~.,iris,Runs=10,method=c("kfold",3,123),model="rpart")
print(mmetric(M,metric="CONF"))
print(mmetric(M,metric="AUC"))
print(meanint(mmetric(M,metric="AUC")))
mgraph(M,graph="ROC",TC=2,baseline=TRUE,Grid=10,leg="Versicolor",
       main="versicolor ROC")
mgraph(M,graph="LIFT",TC=2,baseline=TRUE,Grid=10,leg="Versicolor",
       main="Versicolor ROC")
M2=mining(Species~.,iris,Runs=10,method=c("kfold",3,123),model="ksvm")
L=vector("list",2)
L[[1]]=M;L[[2]]=M2
mgraph(L,graph="ROC",TC=2,baseline=TRUE,Grid=10,leg=c("DT","SVM"),main="ROC")

## End(Not run)

### other classification examples
## Not run: 
### 1st example:
data(iris)
# 2 runs of an external 2-fold validation, random seed
# model selection: SVM model with rbfdot kernel, automatic search for sigma,
#                  internal 3-fold validation, random seed, minimum "AUC" is assumed
# feature selection: none, "s" is used only to store input importance values
M=mining(Species~.,data=iris,Runs=2,method=c("kfold",2,NA),model="ksvm",
         search=list(search=mparheuristic("ksvm"),method=c("kfold",3)),feature="s")

print(mmetric(M,metric="AUC",TC=2))
mgraph(M,graph="ROC",TC=2,baseline=TRUE,Grid=10,leg="SVM",main="ROC",intbar=FALSE)
mgraph(M,graph="IMP",TC=2,Grid=10,main="input importances",xval=0.1,
leg=names(iris),axis=1)
mgraph(M,graph="VEC",TC=2,Grid=10,main="Petal.Width VEC curve",
data=iris,xval=4)
### 2nd example, ordered kfold, k-nearest neigbor:
M=mining(Species~.,iris,Runs=1,method=c("kfoldo",3),model="knn")
# confusion matrix:
print(mmetric(M,metric="CONF"))

### 3rd example, use of all rminer models: 
models=c("naive","ctree","rpart","kknn","mlp","mlpe","ksvm","randomForest","bagging",
         "boosting","lda","multinom","naiveBayes","qda")
models="naiveBayes"
for(model in models)
{ 
 M=mining(Species~.,iris,Runs=1,method=c("kfold",3,123),model=model)
 cat("model:",model,"ACC:",round(mmetric(M,metric="ACC")$ACC,digits=1),"\n")
}

## End(Not run)

### multiple models: automl or ensembles 
## Not run: 

data(iris)
d=iris
names(d)[ncol(d)]="y" # change output name
inputs=ncol(d)-1
metric="AUC"

# simple automl (1 search per individual model),
# internal holdout and external holdout:
sm=mparheuristic(model="automl",n=NA,task="prob",inputs=inputs)
mode="auto"

imethod=c("holdout",4/5,123) # internal validation method
emethod=c("holdout",2/3,567) # external validation method

search=list(search=sm,smethod=mode,method=imethod,metric=metric,convex=0)
M=mining(y~.,data=d,model="auto",search=search,method=emethod,fdebug=TRUE)
# 1 single model was selected:
cat("best",emethod[1],"selected model:",M$object@model,"\n")
cat(metric,"=",round(as.numeric(mmetric(M,metric=metric)),2),"\n")

# simple automl (1 search per individual model),
# internal kfold and external kfold: 
imethod=c("kfold",3,123) # internal validation method
emethod=c("kfold",5,567) # external validation method
search=list(search=sm,smethod=mode,method=imethod,metric=metric,convex=0)
M=mining(y~.,data=d,model="auto",search=search,method=emethod,fdebug=TRUE)
# kfold models were selected:
kfolds=as.numeric(emethod[2])
models=vector(length=kfolds)
for(i in 1:kfolds) models[i]=M$object$model[[i]]
cat("best",emethod[1],"selected models:",models,"\n")
cat(metric,"=",round(as.numeric(mmetric(M,metric=metric)),2),"\n")

# example with weighted ensemble:
M=mining(y~.,data=d,model="WE",search=search,method=emethod,fdebug=TRUE)
for(i in 1:kfolds) models[i]=M$object$model[[i]]
cat("best",emethod[1],"selected models:",models,"\n")
cat(metric,"=",round(as.numeric(mmetric(M,metric=metric)),2),"\n")


## End(Not run)


### for more fitting examples check the help of function fit: help(fit,package="rminer")

[Package rminer version 1.4.6 Index]