smote {PDtoolkit} | R Documentation |
Synthetic Minority Oversampling Technique (SMOTE)
Description
smote
performs type of data augmentation for the selected (usually minority). In order to process continuous and
categorical risk factors simultaneously, Heterogeneity Euclidean Overlapping Metric (HEOM) is used in nearest neighbors
algorithm.
Usage
smote(
db,
target,
minority.class,
osr,
ordinal.rf = NULL,
num.rf.const = NULL,
k = 5,
seed = 81000
)
Arguments
db |
Data set of risk factors and target variable. |
target |
Name of target variable within |
minority.class |
Value of minority class. It can be numeric or character value, but it has to exist in target variable. |
osr |
Oversampling rate. It has to be numeric value greater than 0 (for example 0.2 for 20% oversampling). |
ordinal.rf |
Character vector of ordinal risk factors. Default value is |
num.rf.const |
Data frame with constrains for numeric risk factors. It has to contain the following columns:
|
k |
Number of nearest neighbors. Default value is 5. |
seed |
Random seed needed for ensuring the result reproducibility. Default is 81000. |
Value
The command smote
returns a data frame with added synthetic observations for selected minority class.
The data frame contains all variables from db
data frame plus additional variable (smote
) that serves as
indicator for distinguishing between original and synthetic observations.
Examples
suppressMessages(library(PDtoolkit))
data(loans)
#check numeric variables (note that one of variables is target not a risk factor)
names(loans)[sapply(loans, is.numeric)]
#define constains of numeric risk factors
num.rf.const <- data.frame(rf = c("Duration of Credit (month)", "Credit Amount", "Age (years)"),
lower = c(4, 250, 19),
upper = c(72, 20000, 75),
type = c("integer", "numeric", "integer"))
num.rf.const
#loans$"Account Balance"[990:1000] <- NA
#loans$"Credit Amount"[900:920] <- NA
loans.s <- smote(db = loans,
target = "Creditability",
minority.class = 1,
osr = 0.05,
ordinal.rf = NULL,
num.rf.const = num.rf.const,
k = 5,
seed = 81000)
str(loans.s)
table(loans.s$Creditability, loans.s$smote)
#select minority class
loans.mc <- loans.s[loans.s$Creditability%in%1, ]
head(loans.mc)