R: Four-stage monotonic binning procedure including regression...

ndr.bin {monobin}

R Documentation

Four-stage monotonic binning procedure including regression with nested dummies

Description

ndr.bin implements extension of three-stage monotonic binning procedure (iso.bin) with step of regression with nested dummies as fourth stage. The first stage is isotonic regression used to achieve the monotonicity. The next two stages are possible corrections for minimum percentage of observations and target rate, while the last regression stage is used to identify statistically significant cut points.

Usage

ndr.bin(
  x,
  y,
  sc = c(NA, NaN, Inf, -Inf),
  sc.method = "together",
  y.type = NA,
  min.pct.obs = 0.05,
  min.avg.rate = 0.01,
  p.val = 0.05,
  force.trend = NA
)

Arguments

`x`	Numeric vector to be binned.
`y`	Numeric target vector (binary or continuous).
`sc`	Numeric vector with special case elements. Default values are `c(NA, NaN, Inf, -Inf)`. Recommendation is to keep the default values always and add new ones if needed. Otherwise, if these values exist in `x` and are not defined in the `sc` vector, function will report the error.
`sc.method`	Define how special cases will be treated, all together or separately. Possible values are `"together", "separately"`.
`y.type`	Type of `y`, possible options are `"bina"` (binary) and `"cont"` (continuous). If default value is passed, then algorithm will identify if y is 0/1 or continuous variable.
`min.pct.obs`	Minimum percentage of observations per bin. Default is 0.05 or 30 observations.
`min.avg.rate`	Minimum `y` average rate. Default is 0.05 or 30 observations.
`p.val`	Threshold for p-value of regression coefficients. Default is 0.05. For a binary target binary logistic regression is estimated, whereas for a continuous target, linear regression is used.
`force.trend`	If the expected trend should be forced. Possible values: `"i"` for increasing trend (`y` increases with increase of `x`), `"d"` for decreasing trend (`y` decreases with decrease of `x`). Default value is `NA`. If the default value is passed, then trend will be identified based on the sign of the Spearman correlation coefficient between `x` and `y` on complete cases.

Value

The command ndr.bin generates a list of two objects. The first object, data frame summary.tbl presents a summary table of final binning, while x.trans is a vector of discretized values. In case of single unique value for x or y of complete cases (cases different than special cases), it will return data frame with info.

Examples

suppressMessages(library(monobin))
data(gcd)
age.bin <- ndr.bin(x = gcd$age, y = gcd$qual)
age.bin[[1]]
table(age.bin[[2]])
#linear regression example
amount.bin <- ndr.bin(x = gcd$amount, y = gcd$qual, y.type = "cont", p.val = 0.05)
#create nested dummies
db.reg <- gcd[, c("qual", "amount")]
db.reg$amount.bin <- amount.bin[[2]]
amt.s <- db.reg %>% 
      group_by(amount.bin) %>% 
      summarise(qual.mean = mean(qual),
		    amt.min = min(amount))
mins <- amt.s$amt.min
for (i in 2:length(mins)) {
	 level.l <- mins[i]
 nd <- ifelse(db.reg$amount < level.l, 0, 1)
 db.reg <- cbind.data.frame(db.reg, nd)
 names(db.reg)[ncol(db.reg)] <- paste0("dv_", i)
 }
reg.f <- paste0("qual ~ dv_2 + dv_3")
lrm <- lm(as.formula(reg.f), data = db.reg)
lr.coef <- data.frame(summary(lrm)$coefficients)
lr.coef
cumsum(lr.coef$Estimate)
#check
as.data.frame(amt.s)
diff(amt.s$qual.mean)