SCUT {scutr} | R Documentation |
SMOTE and cluster-based undersampling technique.
Description
This function balances multiclass training datasets. In a dataframe with n
classes and m
rows, the resulting dataframe will have m / n
rows per class. SCUT_parallel()
distributes each over/undersampling task across multiple cores. Speedup usually occurs only if there are many classes using one of the slower resampling techniques (e.g. undersample_mclust()
). Note that SCUT_parallel()
will always run on one core on Windows.
Usage
SCUT(
data,
cls_col,
oversample = oversample_smote,
undersample = undersample_mclust,
osamp_opts = list(),
usamp_opts = list()
)
SCUT_parallel(
data,
cls_col,
ncores = detectCores()%/%2,
oversample = oversample_smote,
undersample = undersample_mclust,
osamp_opts = list(),
usamp_opts = list()
)
Arguments
data |
Numeric data frame. |
cls_col |
The column in |
oversample |
Oversampling method. Must be a function with the signature |
undersample |
Undersampling method. Must be a function with the signature |
osamp_opts |
List of options passed to the oversampling function. |
usamp_opts |
List of options passed to the undersampling function. |
ncores |
Number of cores to use with |
Details
Custom functions can be used to perform under/oversampling (see the required signature below). Parameters represented by ...
should be passsed via osamp_opts
or usamp_opts
as a list.
Value
A dataframe with equal class distribution.
References
Agrawal A, Viktor HL, Paquet E (2015). 'SCUT: Multi-class imbalanced data classification using SMOTE and cluster-based undersampling.' In 2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K), volume 01, 226-234.
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002). 'SMOTE: Synthetic Minority Over-sampling Technique.' Journal of Artificial Intelligence Research, 16, 321-357. ISSN 1076-9757, doi:10.1613/jair.953, https://www.jair.org/index.php/jair/article/view/10302.
Examples
ret <- SCUT(iris, "Species", undersample = undersample_hclust,
usamp_opts = list(dist_calc="manhattan"))
ret2 <- SCUT(chickwts, "feed", undersample = undersample_kmeans)
table(ret$Species)
table(ret2$feed)
# SCUT_parallel fires a warning if ncores > 1 on Windows and will run on
# one core only.
ret <- SCUT_parallel(wine, "type", ncores = 1, undersample = undersample_kmeans)
table(ret$type)