R: Optimized spline PDF and CDF fitted to binned data

splinebins {binsmooth}

R Documentation

Optimized spline PDF and CDF fitted to binned data

Description

Creates a smooth cubic spline CDF and piecewise-quadratic PDF based on a set of binned data (edges and counts).

Usage

splinebins(bEdges, bCounts, m = NULL,
           numIterations = 16, monoMethod = c("hyman", "monoH.FC"))

Arguments

`bEdges`	A vector `e_1, e_2, \ldots, e_n` giving the right endpoints of each bin. The value in `e_n` is ignored and assumed to be `Inf` or `NA`, indicating that the top bin is unbounded. The edges determine `n` bins on the intervals `e_{i-1} \le x \le e_i`, where `e_0` is assumed to be 0.
`bCounts`	A vector `c_1, c_2, \ldots, c_n` giving the counts for each bin (i.e., the number of data elements in each bin). Assumed to be nonnegative.
`m`	An estimate for the mean of the distribution. If no value is supplied, the mean will be estimated by (temporarily) setting `e_n` equal to `2e_{n-1}`, and a warning message will be generated.
`numIterations`	The number of iterations performed by a binary search that optimizes the CDF to fit the mean.
`monoMethod`	The method for constructing a monotone spline. Must be one of `"hyman"` or `"monoH.FC"`. The former choice tends to integrate faster and produce smoother density functions. See `splinefun` for more details.

Details

Fits a monotone cubic spline to the points specified by the binned data to produce a smooth cumulative distribution function. The PDF is then obtained by differentiating, so it will be piecewise quadratic and preserve the area of each bin.

Value

Returns a list with the following components.

`splinePDF`	A piecewise-quadratic function giving the fitted PDF.
`splineCDF`	A piecewise-cubic function giving the CDF.
`E`	The right-hand endpoint of the support of the PDF.
`shrinkFactor`	If the supplied estimate for the mean is too small to be fitted with our method, the bins edges will be scaled by `shrinkFactor`, which will be chosen less than (and close to) 1.
`splineInvCDF`	An approximate inverse of `splineCDF`.
`fitWarn`	Flag set to `TRUE` if the fitted median falls in the wrong bin.

Author(s)

David J. Hunter and McKalie Drown

References

Paul T. von Hippel, David J. Hunter, McKalie Drown. Better Estimates from Binned Income Data: Interpolated CDFs and Mean-Matching, Sociological Science, November 15, 2017. https://www.sociologicalscience.com/articles-v4-26-641/

Examples

# 2005 ACS data from Cook County, Illinois
binedges <- c(10000,15000,20000,25000,30000,35000,40000,45000,
              50000,60000,75000,100000,125000,150000,200000,NA)
bincounts <- c(157532,97369,102673,100888,90835,94191,87688,90481,
               79816,153581,195430,240948,155139,94527,92166,103217)
sb <- stepbins(binedges, bincounts, 76091)
splb <- splinebins(binedges, bincounts, 76091)

plot(splb$splinePDF, 0, 300000, n=500)
plot(sb$stepPDF, do.points=FALSE, col="gray", add=TRUE)
# notice that the curve preserves bin area

library(pracma)
integral(splb$splinePDF, 0, splb$E)
integral(function(x){1-splb$splineCDF(x)}, 0, splb$E) # should be the mean
splb <- splinebins(binedges, bincounts, 76091, numIterations=20)
integral(function(x){1-splb$splineCDF(x)}, 0, splb$E) # closer to given mean

[Package binsmooth version 0.2.2 Index]