smote {PDtoolkit}R Documentation

Synthetic Minority Oversampling Technique (SMOTE)

Description

smote performs type of data augmentation for the selected (usually minority). In order to process continuous and categorical risk factors simultaneously, Heterogeneity Euclidean Overlapping Metric (HEOM) is used in nearest neighbors algorithm.

Usage

smote(
  db,
  target,
  minority.class,
  osr,
  ordinal.rf = NULL,
  num.rf.const = NULL,
  k = 5,
  seed = 81000
)

Arguments

db

Data set of risk factors and target variable.

target

Name of target variable within db argument.

minority.class

Value of minority class. It can be numeric or character value, but it has to exist in target variable.

osr

Oversampling rate. It has to be numeric value greater than 0 (for example 0.2 for 20% oversampling).

ordinal.rf

Character vector of ordinal risk factors. Default value is NULL.

num.rf.const

Data frame with constrains for numeric risk factors. It has to contain the following columns: rf(numeric risk factor names from db), lower (lower bound of numeric risk factor), upper (upper bound of numeric risk factor), type (type of numeric risk factor - "numeric" or "integer"). Constrains are used for correction of synthetic data for selected numeric risk factors. Default value is NULL which means that no corrections are assumed.

k

Number of nearest neighbors. Default value is 5.

seed

Random seed needed for ensuring the result reproducibility. Default is 81000.

Value

The command smote returns a data frame with added synthetic observations for selected minority class. The data frame contains all variables from db data frame plus additional variable (smote) that serves as indicator for distinguishing between original and synthetic observations.

Examples

suppressMessages(library(PDtoolkit))
data(loans)
#check numeric variables (note that one of variables is target not a risk factor)
names(loans)[sapply(loans, is.numeric)]
#define constains of numeric risk factors
num.rf.const <- data.frame(rf = c("Duration of Credit (month)", "Credit Amount", "Age (years)"),
			   lower = c(4, 250, 19),
			   upper = c(72, 20000, 75),
			   type = c("integer", "numeric", "integer"))
num.rf.const

#loans$"Account Balance"[990:1000] <- NA
#loans$"Credit Amount"[900:920] <- NA

loans.s <- smote(db = loans,
	     target = "Creditability",
	     minority.class = 1,  
	     osr = 0.05,
	     ordinal.rf = NULL, 
	     num.rf.const = num.rf.const, 
	     k = 5, 
	     seed = 81000)
str(loans.s)
table(loans.s$Creditability, loans.s$smote)
#select minority class
loans.mc <- loans.s[loans.s$Creditability%in%1, ]
head(loans.mc)

[Package PDtoolkit version 1.2.0 Index]