optbin {optbin}R Documentation

Optimal Binning of Continuous Variables

Description

Determines break points in numeric data that minimize the difference between each point in a bin and the average over it.

Usage

optbin(x, numbin, metric=c('se', 'mse'), is.sorted=FALSE, max.cache=2^31, na.rm=FALSE)

Arguments

x

numeric data

numbin

number of bins to partition vector into

metric

minimize squared error (se) between values and average over bin, or mean squared error (mse) dividing squared error by bin length

is.sorted

set true if x is already in increasing order

max.cache

maximum memory in bytes to use to cache bin metrics; if analysis would need more than use slower calculation without cache

na.rm

drop NA values (which may occur when converting the data to a vector), otherwise cannot proceed with binning

Details

Data is converted into a numeric vector and sorted if necessary. Internally bins are determined by positions within the vector, with the breaks inclusive at the upper end. The bin thresholds are the same, so bin b covers the range thr[b-1] < x <= thr[b], where thr[0] is -Inf. The routine finds the first split found with the best metric, if there is more than one.

The library uses an exhaustive search over all possible breakpoints. It begins by finding the best splits with 2 bins for all pairs of start and endpoints, then adds a third bin, and so on. This rejects most alternatives at each level, leaving an O(nbin * nval * nval) algorithm.

Value

An object of class 'optbin' with components:

x

the original data, sorted

numbins

the number of bins created

call

argument values when function called

metric

cost function used to select best partition

minse

value of SE/MSE metric for all bins

thr

upper threshold of bin range, inclusive

binavg

average of values in each bin

binse

value of SE/MSE metric for each bin

breaks

positions of endpoint (inclusive) of each bin in x

See Also

assign.optbin, print.optbin, summary.optbin, plot.optbin

Examples

## Well separated groups
set.seed(17)
d1 <- c(rnorm(75, mean=1, sd=0.2), rnorm(75, mean=3, sd=0.2),
        rnorm(84, mean=6, sd=0.2), rnorm(75, mean=9, sd=0.2),
        rnorm(75, mean=11, sd=0.2), rnorm(150, mean=15, sd=0.2))
## Divides into groups 1+2+3, 4+5, 6, metric is 1176.3
binned3 <- optbin(d1, 3)
summary(binned3)
plot(binned3)
## Divides into groups 1, 2, 3, 4+5, and 6, metric is 169.9
binned5 <- optbin(d1, 5)
plot(binned5)
## Divides into separate groups, metric is 24.4
binned6 <- optbin(d1, 6)
summary(binned6)
plot(binned6)
## Each rnorm group divides roughly in half.
binned12 <- optbin(d1, 12)
plot(binned12)
## A grouping that overlaps, bins near but not at minima between peaks
d2 <- c(rnorm(300, mean=1, sd=0.25), rnorm(400, mean=2, sd=0.25),
        rnorm(300, mean=3, sd=0.25))
binned3b <- optbin(d2, 3)
hist(binned3b, breaks=50, col='yellow')

[Package optbin version 1.3 Index]