SmoteClassif {UBL} | R Documentation |
SMOTE algorithm for unbalanced classification problems
Description
This function handles unbalanced classification problems using the SMOTE method. Namely, it can generate a new "SMOTEd" data set that addresses the class unbalance problem.
Usage
SmoteClassif(form, dat, C.perc = "balance", k = 5, repl = FALSE,
dist = "Euclidean", p = 2)
Arguments
form |
A formula describing the prediction problem |
dat |
A data frame containing the original (unbalanced) data set |
C.perc |
A named list containing the percentage(s) of under- or/and over-sampling to apply to each class. The over-sampling percentage is a number above 1 while the under-sampling percentage should be a number below 1. If the number 1 is provided for a given class then that class remains unchanged. Alternatively it may be "balance" (the default) or "extreme", cases where the sampling percentages are automatically estimated either to balance the examples between the minority and majority classes or to invert the distribution of examples across the existing classes transforming the majority classes into minority and vice-versa. |
k |
A number indicating the number of nearest neighbors that are used to generate the new examples of the minority class(es). |
repl |
A boolean value controlling the possibility of having repetition of examples when performing under-sampling by selecting among the majority class(es) examples. |
dist |
A character string indicating which distance metric to use when determining the k nearest neighbors. See the details. Defaults to "Euclidean". |
p |
A number indicating the value of p if the "p-norm" distance is chosen. Only necessary to define if a "p-norm" is chosen in the |
Details
dist
parameter:The parameter
dist
allows the user to define the distance metric to be used in the neighbors computation. Although the default is the Euclidean distance, other metrics are available. This allows the computation of distances in data sets with, for instance, both nominal and numeric features. The options available for the distance functions are as follows:- for data with only numeric features: "Manhattan", "Euclidean", "Canberra", "Chebyshev", "p-norm";
- for data with only nominal features: "Overlap";
- for dealing with both nominal and numeric features: "HEOM", "HVDM".
When the "p-norm" is selected for the
dist
parameter, it is also necessary to define the value of parameterp
. The value of parameterp
sets which "p-norm" will be used. For instance, ifp
is set to 1, the "1-norm" (or Manhattan distance) is used, and ifp
is set to 2, the "2-norm" (or Euclidean distance) is applied. For more details regarding the distance functions implemented in UBL package please see the package vignettes.- Smote algorithm:
Unbalanced classification problems cause problems to many learning algorithms. These problems are characterized by the uneven proportion of cases that are available for each class of the problem.
SMOTE (Chawla et. al. 2002) is a well-known algorithm to fight this problem. The general idea of this method is to artificially generate new examples of the minority class using the nearest neighbors of these cases. Furthermore, the majority class examples are also under-sampled, leading to a more balanced dataset.
The parameter
C.perc
controls the amount of over-sampling and under-sampling applied and can be automatically estimated either to balance or invert the distribution of examples across the different classes. The parameterk
controls the number of neighbors used to generate new synthetic examples.
Value
The function returns a data frame with the new data set resulting from the application of the SMOTE algorithm.
Author(s)
Paula Branco paobranco@gmail.com, Rita Ribeiro rpribeiro@dcc.fc.up.pt and Luis Torgo ltorgo@dcc.fc.up.pt
References
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321-357.
See Also
RandUnderClassif, RandOverClassif
Examples
## A small example with a data set created artificially from the IRIS
## data
data(iris)
dat <- iris[, c(1, 2, 5)]
dat$Species <- factor(ifelse(dat$Species == "setosa", "rare", "common"))
## checking the class distribution of this artificial data set
table(dat$Species)
## now using SMOTE to create a more "balanced problem"
newData <- SmoteClassif(Species ~ ., dat, C.perc = list(common = 1,rare = 6))
table(newData$Species)
## Checking visually the created data
par(mfrow = c(1, 2))
plot(dat[, 1], dat[, 2], pch = 19 + as.integer(dat[, 3]),
main = "Original Data")
plot(newData[, 1], newData[, 2], pch = 19 + as.integer(newData[, 3]),
main = "SMOTE'd Data")
# automatically balancing the data maintaining the total number of examples
datBal <- SmoteClassif(Species ~ ., dat, C.perc = "balance")
table(datBal$Species)
# automatically inverting the original distribution of examples
datExt <- SmoteClassif(Species ~ ., dat, C.perc = "extreme")
table(datExt$Species)
if (requireNamespace("DMwR2", quietly = TRUE)) {
data(algae, package ="DMwR2")
clean.algae <- data.frame(algae[complete.cases(algae), ])
C.perc = list(autumn = 2, summer = 1.5, winter = 0.9)
# class spring remains unchanged
# In this case it is necessary to define a distance function that
# is able to deal with both nominal and numeric features
mysmote.algae <- SmoteClassif(season~., clean.algae, C.perc, dist = "HEOM")
# the distance function may be HVDM
smoteBalan.algae <- SmoteClassif(season~., clean.algae, "balance",
dist = "HVDM")
smoteExtre.algae <- SmoteClassif(season~., clean.algae, "extreme",
dist = "HVDM")
} else {
library(MASS)
data(cats)
mysmote.cats <- SmoteClassif(Sex~., cats, list(M = 0.8, F = 1.8))
smoteBalan.cats <- SmoteClassif(Sex~., cats, "balance")
smoteExtre.cats <- SmoteClassif(Sex~., cats, "extreme")
}