randomMachines {randomMachines}R Documentation

Random Machines

Description

Random Machines is an ensemble model which uses the combination of different kernel functions to improve the diversity in the bagging approach, improving the predictions in general. Random Machines was developed for classification and regression problems by bagging multiple kernel functions in support vector models.

Random Machines uses SVMs (Cortes and Vapnik, 1995) as base learners in the bagging procedure with a random sample of kernel functions to build them.

Let a training sample given by (\boldsymbol{x_{i}},y_i) with i=1,\dots, n observations, where \boldsymbol{x_{i}} is the vector of independent variables and y_{i} the dependent one. The kernel bagging method initializes by training of the r single learner, where r=1,\dots,R and R is the total number of different kernel functions that could be used in support vector models. In this implementation the default value is R=4 (gaussian, polynomial, laplacian and linear). See more details below.

Each single learner is internally validated and the weights \lambda_{r} are calculated proportionally to the strength from the single predictive performance.

Afterwards, B bootstrap samples are sampled from the training set. A support vector machine model g_{b} is trained for each bootstrap sample, b=i,\dots,B and the kernel function that will be used for g_{b} will be determined by a random choice with probability \lambda_{r}. The final weight w_b in the bagging procedure is calculated by out-of-bag samples.

The final model G(\boldsymbol{x}_i) for a new \boldsymbol{x}_i is given by,

The weights \lambda_{r} and w_b are different calculated for each task (classification, probabilistic classification and regression). See more details in the references.

Usage

randomMachines(
     formula,
     train,validation,
     B = 25, cost = 1,
     automatic_tuning = FALSE,
     gamma_rbf = 1,
     gamma_lap = 1,
     degree = 2,
     poly_scale = 1,
     offset = 0,
     gamma_cau = 1,
     d_t = 2,
     kernels = c("rbfdot", "polydot", "laplacedot", "vanilladot"),
     prob_model = TRUE,
     loss_function = RMSE,
     epsilon = 0.1,
     beta = 2
)

Arguments

formula

an object of class formula: it should contain a symbolic description of the model to be fitted, indicating the dependent variable and all predictors that should be included.

train

the training data \left\{\left( \mathbf{x}_{i},y_{i} \right)\right\}_{i=1}^{n} used to train the model.

validation

the validation data \left\{\left( \mathbf{x}_{i},y_{i}\right) \right\}_{i=1}^{V} used to calculate probabilities \lambda_{r}. If validation = NULL,the validation set is going be selected as 0.25 partition from the training data, and the remaining partition is selected as the new training sample.

B

number of bootstrap samples. The default value is B=25.

cost

the C-constant term of the regularization on soft margins at support vector models. The default value is cost=1.

automatic_tuning

boolean to define if the kernel hyperparameters will be selected using the sigest from the ksvm function. The default value is FALSE.

gamma_rbf

the hyperparameter \gamma_{g} used in the RBF kernel. The default value is gamma_rbf=1.

gamma_lap

the hyperparameter \gamma_{l} used in the Laplacian kernel. The default value is gamma_lap=1.

degree

the degree used in the Polynomial kernel. The default value is degree=2.

poly_scale

the scale parameter from the Polynomial kernel. The default value is poly_scale=1.

offset

the offset parameter from the Polynomial kernel. The default value is offset=0.

gamma_cau

the hyperparameter \gamma_{c} used in the Cauchy kernel. The default value is gamma_cau=1.

d_t

the d_{t}-norm from the t-Student kernel. The default value is d_t=2.

kernels

a vector with the name of kernel functions that will be used in the Random Machines model. The default include the kernel functions: c("rbfdot", "polydot", "laplacedot", "vanilladot"). The other kernel functions as "cauchydot" and "tdot" are exclusive to the binary classification setting.

prob_model

a boolean to define if the algorithm will be using a probabilistic approach to the define the predictions (default = TRUE).

loss_function

Define which loss function is going to be used in the regression approach. The default is the RMSE function but others can be added.

epsilon

The epsilon in the loss function used from the SVR implementation. The default value is epsilon=0.1.

beta

The correlation parameter \beta which calibrates the penalisation of each kernel performance in regression tasks. The default value is beta=2.

Details

The Random Machines is an ensemble method which combines the bagging procedure proposed by Breiman (1996), using Support Vector Machine models as base learners jointly with a random selection of kernel functions that add diversity to the ensemble without harming its predictive performance. The kernel functions k(x,y) are described by the functions below,

Value

randomMachines() returns an object of class "rm_class" for classification tasks or "rm_reg" for if the target variable is a continuous numerical response. See predict.rm_class or predict.rm_reg for more details of how to obtain predictions from each model respectively.

Author(s)

Mateus Maia: mateusmaia11@gmail.com, Gabriel Felipe Ribeiro: brielribeiro08@gmail.com, Anderson Ara: ara@ufpr.br

References

Ara, Anderson, et al. "Regression random machines: An ensemble support vector regression model with free kernel choice." Expert Systems with Applications 202 (2022): 117107.

Ara, Anderson, et al. "Random machines: A bagged-weighted support vector model with free kernel choice." Journal of Data Science 19.3 (2021): 409-428.

Breiman, L. (1996). Bagging predictors. Machine learning, 24, 123-140.

Cortes, C., and Vapnik, V. (1995). Support-vector networks. Machine learning, 20, 273-297.

Maia, Mateus, Arthur R. Azevedo, and Anderson Ara. "Predictive comparison between random machines and random forests." Journal of Data Science 19.4 (2021): 593-614.

Examples

library(randomMachines)

# Simulation from a binary output context
sim_data <- sim_class(n = 75)

## Setting the training and validation set
sim_new <- sim_class(n = 75)

# Modelling Random Machines (probabilistic output)
rm_mod_prob <- randomMachines(y~., train = sim_data)

## Modelling Random Machines (binary class output)
rm_mod_label <- randomMachines(y~., train = sim_data,prob_model = FALSE)

## Predicting for new data
y_hat <- predict(rm_mod_label,sim_new)

[Package randomMachines version 0.1.0 Index]