jous {JOUSBoost}R Documentation

Jittering with Over/Under Sampling

Description

Perform probability estimation using jittering with over or undersampling.

Usage

jous(X, y, class_func, pred_func, type = c("under", "over"), delta = 10,
  nu = 1, X_pred = NULL, keep_models = FALSE, verbose = FALSE,
  parallel = FALSE, packages = NULL)

Arguments

X

A matrix of continuous predictors.

y

A vector of responses with entries in c(-1, 1).

class_func

Function to perform classification. This function definition must be exactly of the form class_func(X, y) where X is a matrix and y is a vector with entries in c(-1, 1), and it must return an object on which pred_func can create predictions. See examples.

pred_func

Function to create predictions. This function definition must be exactly of the form pred_func(fit_obj, X) where fit_obj is an object returned by class_func and X is a matrix of new data values, and it must return a vector with entries in c(-1, 1). See examples.

type

Type of sampling: "over" for oversampling, or "under" for undersampling.

delta

An integer (greater than 3) to control the number of quantiles to estimate:

nu

The amount of noise to apply to predictors when oversampling data. The noise level is controlled by nu * sd(X[,j]) for each predictor - the default of nu = 1 works well. Such "jittering" of the predictors is essential when applying jous to boosting type methods.

X_pred

A matrix of predictors for which to form probability estimates.

keep_models

Whether to store all of the models used to create the probability estimates. If type=FALSE, the user will need to re-run jous when creating probability estimates for test data.

verbose

If TRUE, print the function's progress to the terminal.

parallel

If TRUE, use parallel foreach to fit models. Must register parallel before hand, such as doParallel. See examples below.

packages

If parallel = TRUE, a vector of strings containing the names of any packages used in class_func or pred_func. See examples below.

Value

Returns a list containing information about the parameters used in the jous function call, as well as the following additional components:

q

The vector of target quantiles estimated by jous. Note that the estimated probabilities will be located at the midpoints of the values in q.

phat_train

The in-sample probability estimates p(y=1|x).

phat_test

Probability estimates for the optional test data in X_test

models

If keep_models=TRUE, a list of models fitted to the resampled data sets.

confusion_matrix

A confusion matrix for the in-sample fits.

Note

The jous function runs the classifier class_func a total of delta times on the data, which can be computationally expensive. Also,jous cannot yet be applied to categorical predictors - in the oversampling case, it is not clear how to "jitter" a categorical variable.

References

Mease, D., Wyner, A. and Buja, A. (2007). Costweighted boosting with jittering and over/under-sampling: JOUS-boost. J. Machine Learning Research 8 409-439.

Examples

## Not run: 
# Generate data from Friedman model #
set.seed(111)
dat = friedman_data(n = 500, gamma = 0.5)
train_index = sample(1:500, 400)

# Apply jous to adaboost classifier
class_func = function(X, y) adaboost(X, y, tree_depth = 2, n_rounds = 200)
pred_func = function(fit_obj, X_test) predict(fit_obj, X_test)

jous_fit = jous(dat$X[train_index,], dat$y[train_index], class_func,
                pred_func, keep_models = TRUE)
# get probability
phat_jous = predict(jous_fit, dat$X[-train_index, ], type = "prob")

# compare with probability from AdaBoost
ada = adaboost(dat$X[train_index,], dat$y[train_index], tree_depth = 2,
               n_rounds = 200)
phat_ada = predict(ada, dat$X[train_index,], type = "prob")

mean((phat_jous - dat$p[-train_index])^2)
mean((phat_ada - dat$p[-train_index])^2)

## Example using parallel option

library(doParallel)
cl <- makeCluster(4)
registerDoParallel(cl)

# n.b. the packages='rpart' is not really needed here since it gets
# exported automatically by JOUSBoost, but for illustration
jous_fit = jous(dat$X[train_index,], dat$y[train_index], class_func,
                pred_func, keep_models = TRUE, parallel = TRUE,
                packages = 'rpart')
phat = predict(jous_fit, dat$X[-train_index,], type = 'prob')
stopCluster(cl)

## Example using SVM

library(kernlab)
class_func = function(X, y) ksvm(X, as.factor(y), kernel = 'rbfdot')
pred_func = function(obj, X) as.numeric(as.character(predict(obj, X)))
jous_obj = jous(dat$X[train_index,], dat$y[train_index], class_func = class_func,
           pred_func = pred_func, keep_models = TRUE)
jous_pred = predict(jous_obj, dat$X[-train_index,], type = 'prob')

## End(Not run)

[Package JOUSBoost version 2.1.0 Index]