binsqreg {binsreg}R Documentation

Data-Driven Binscatter Quantile Regression with Robust Inference Procedures and Plots

Description

binsqreg implements binscatter quantile regression with robust inference procedures and plots, following the results in Cattaneo, Crump, Farrell and Feng (2024a) and Cattaneo, Crump, Farrell and Feng (2024b). Binscatter provides a flexible way to describe the quantile relationship between two variables, after possibly adjusting for other covariates, based on partitioning/binning of the independent variable of interest. The main purpose of this function is to generate binned scatter plots with curve estimation with robust pointwise confidence intervals and uniform confidence band. If the binning scheme is not set by the user, the companion function binsregselect is used to implement binscatter in a data-driven way. Hypothesis testing about the function of interest can be conducted via the companion function binstest.

Usage

binsqreg(y, x, w = NULL, data = NULL, at = NULL, quantile = 0.5,
  deriv = 0, dots = NULL, dotsgrid = 0, dotsgridmean = T,
  line = NULL, linegrid = 20, ci = NULL, cigrid = 0, cigridmean = T,
  cb = NULL, cbgrid = 20, polyreg = NULL, polyreggrid = 20,
  polyregcigrid = 0, by = NULL, bycolors = NULL, bysymbols = NULL,
  bylpatterns = NULL, legendTitle = NULL, legendoff = F, nbins = NULL,
  binspos = "qs", binsmethod = "dpi", nbinsrot = NULL, pselect = NULL,
  sselect = NULL, samebinsby = F, randcut = NULL, nsims = 500,
  simsgrid = 20, simsseed = NULL, vce = "nid", cluster = NULL,
  asyvar = F, level = 95, noplot = F, dfcheck = c(20, 30),
  masspoints = "on", weights = NULL, subset = NULL, plotxrange = NULL,
  plotyrange = NULL, qregopt = NULL, ...)

Arguments

y

outcome variable. A vector.

x

independent variable of interest. A vector.

w

control variables. A matrix, a vector or a formula.

data

an optional data frame containing variables in the model.

at

value of w at which the estimated function is evaluated. The default is at="mean", which corresponds to the mean of w. Other options are: at="median" for the median of w, at="zero" for a vector of zeros. at can also be a vector of the same length as the number of columns of w (if w is a matrix) or a data frame containing the same variables as specified in w (when data is specified). Note that when at="mean" or at="median", all factor variables (if specified) are excluded from the evaluation (set as zero).

quantile

the quantile to be estimated. A number strictly between 0 and 1.

deriv

derivative order of the regression function for estimation, testing and plotting. The default is deriv=0, which corresponds to the function itself.

dots

a vector or a logical value. If dots=c(p,s), a piecewise polynomial of degree p with s smoothness constraints is used for point estimation and plotting as "dots". The default is dots=c(0,0), which corresponds to piecewise constant (canonical binscatter). If dots=T, the default dots=c(0,0) is used unless the degree p or smoothness s selection is requested via the option pselect or sselect (see more details in the explanation of pselect and sselect). If dots=F is specified, the dots are not included in the plot.

dotsgrid

number of dots within each bin to be plotted. Given the choice, these dots are point estimates evaluated over an evenly-spaced grid within each bin. The default is dotsgrid=0, and only the point estimates at the mean of x within each bin are presented.

dotsgridmean

If true, the dots corresponding to the point estimates evaluated at the mean of x within each bin are presented. By default, they are presented, i.e., dotsgridmean=T.

line

a vector or a logical value. If line=c(p,s), a piecewise polynomial of degree p with s smoothness constraints is used for plotting as a "line". If line=T is specified, line=c(0,0) is used unless the degree p or smoothness s selection is requested via the option pselect or sselect (see more details in the explanation of pselect and sselect). If line=F or line=NULL is specified, the line is not included in the plot. The default is line=NULL.

linegrid

number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the point estimate set by the line=c(p,s) option. The default is linegrid=20, which corresponds to 20 evenly-spaced evaluation points within each bin for fitting/plotting the line.

ci

a vector or a logical value. If ci=c(p,s) a piecewise polynomial of degree p with s smoothness constraints is used for constructing confidence intervals. If ci=T is specified, ci=c(1,1) is used unless the degree p or smoothness s selection is requested via the option pselect or sselect (see more details in the explanation of pselect and sselect). If ci=F or ci=NULL is specified, the confidence intervals are not included in the plot. The default is ci=NULL.

cigrid

number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the point estimate set by the ci=c(p,s) option. The default is cigrid=1, which corresponds to 1 evenly-spaced evaluation point within each bin for confidence interval construction.

cigridmean

If true, the confidence intervals corresponding to the point estimates evaluated at the mean of x within each bin are presented. The default is cigridmean=T.

cb

a vector or a logical value. If cb=c(p,s), a the piecewise polynomial of degree p with s smoothness constraints is used for constructing the confidence band. If the option cb=T is specified, cb=c(1,1) is used unless the degree p or smoothness s selection is requested via the option pselect or sselect (see more details in the explanation of pselect and sselect). If cb=F or cb=NULL is specified, the confidence band is not included in the plot. The default is cb=NULL.

cbgrid

number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the point estimate set by the cb=c(p,s) option. The default is cbgrid=20, which corresponds to 20 evenly-spaced evaluation points within each bin for confidence interval construction.

polyreg

degree of a global polynomial regression model for plotting. By default, this fit is not included in the plot unless explicitly specified. Recommended specification is polyreg=3, which adds a cubic (global) polynomial fit of the regression function of interest to the binned scatter plot.

polyreggrid

number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the point estimate set by the polyreg=p option. The default is polyreggrid=20, which corresponds to 20 evenly-spaced evaluation points within each bin for confidence interval construction.

polyregcigrid

number of evaluation points of an evenly-spaced grid within each bin used for constructing confidence intervals based on polynomial regression set by the polyreg=p option. The default is polyregcigrid=0, which corresponds to not plotting confidence intervals for the global polynomial regression approximation.

by

a vector containing the group indicator for subgroup analysis; both numeric and string variables are supported. When by is specified, binsreg implements estimation and inference for each subgroup separately, but produces a common binned scatter plot. By default, the binning structure is selected for each subgroup separately, but see the option samebinsby below for imposing a common binning structure across subgroups.

bycolors

an ordered list of colors for plotting each subgroup series defined by the option by.

bysymbols

an ordered list of symbols for plotting each subgroup series defined by the option by.

bylpatterns

an ordered list of line patterns for plotting each subgroup series defined by the option by.

legendTitle

String, title of legend.

legendoff

If true, no legend is added.

nbins

number of bins for partitioning/binning of x. If nbins=T or nbins=NULL (default) is specified, the number of bins is selected via the companion command binsregselect in a data-driven, optimal way whenever possible. If a vector with more than one number is specified, the number of bins is selected within this vector via the companion command binsregselect.

binspos

position of binning knots. The default is binspos="qs", which corresponds to quantile-spaced binning (canonical binscatter). The other options are "es" for evenly-spaced binning, or a vector for manual specification of the positions of inner knots (which must be within the range of x).

binsmethod

method for data-driven selection of the number of bins. The default is binsmethod="dpi", which corresponds to the IMSE-optimal direct plug-in rule. The other option is: "rot" for rule of thumb implementation.

nbinsrot

initial number of bins value used to construct the DPI number of bins selector. If not specified, the data-driven ROT selector is used instead.

pselect

vector of numbers within which the degree of polynomial p for point estimation is selected. Piecewise polynomials of the selected optimal degree p are used to construct dots or line if dots=T or line=T is specified, whereas piecewise polynomials of degree p+1 are used to construct confidence intervals or confidence band if ci=T or cb=T is specified. Note: To implement the degree or smoothness selection, in addition to pselect or sselect, nbins=# must be specified.

sselect

vector of numbers within which the number of smoothness constraints s for point estimation is selected. Piecewise polynomials with the selected optimal s smoothness constraints are used to construct dots or line if dots=T or line=T is specified, whereas piecewise polynomials with s+1 constraints are used to construct confidence intervals or confidence band if ci=T or cb=T is specified. If not specified, for each value p supplied in the option pselect, only the piecewise polynomial with the maximum smoothness is considered, i.e., s=p.

samebinsby

if true, a common partitioning/binning structure across all subgroups specified by the option by is forced. The knots positions are selected according to the option binspos and using the full sample. If nbins is not specified, then the number of bins is selected via the companion command binsregselect and using the full sample.

randcut

upper bound on a uniformly distributed variable used to draw a subsample for bins/degree/smoothness selection. Observations for which runif()<=# are used. # must be between 0 and 1. By default, max(5000, 0.01n) observations are used if the samples size n>5000.

nsims

number of random draws for constructing confidence bands. The default is nsims=500, which corresponds to 500 draws from a standard Gaussian random vector of size [(p+1)*J - (J-1)*s]. Setting at least nsims=2000 is recommended to obtain the final results.

simsgrid

number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the supremum operation needed to construct confidence bands. The default is simsgrid=20, which corresponds to 20 evenly-spaced evaluation points within each bin for approximating the supremum operator. Setting at least simsgrid=50 is recommended to obtain the final results.

simsseed

seed for simulation.

vce

Procedure to compute the variance-covariance matrix estimator (see summary.rq for more details). Options are

  • "iid" which presumes that the errors are iid and computes an estimate of the asymptotic covariance matrix as in KB(1978).

  • "nid" which presumes local (in quantile) linearity of the the conditional quantile functions and computes a Huber sandwich estimate using a local estimate of the sparsity.

  • "ker" which uses a kernel estimate of the sandwich as proposed by Powell (1991).

  • "boot" which implements one of several possible bootstrapping alternatives for estimating standard errors including a variate of the wild bootstrap for clustered response. See boot.rq for further details.

cluster

cluster ID. Used for compute cluster-robust standard errors.

asyvar

if true, the standard error of the nonparametric component is computed and the uncertainty related to control variables is omitted. Default is asyvar=FALSE, that is, the uncertainty related to control variables is taken into account.

level

nominal confidence level for confidence interval and confidence band estimation. Default is level=95.

noplot

if true, no plot produced.

dfcheck

adjustments for minimum effective sample size checks, which take into account number of unique values of x (i.e., number of mass points), number of clusters, and degrees of freedom of the different statistical models considered. The default is dfcheck=c(20, 30). See Cattaneo, Crump, Farrell and Feng (2024c) for more details.

masspoints

how mass points in x are handled. Available options:

  • "on" all mass point and degrees of freedom checks are implemented. Default.

  • "noadjust" mass point checks and the corresponding effective sample size adjustments are omitted.

  • "nolocalcheck" within-bin mass point and degrees of freedom checks are omitted.

  • "off" "noadjust" and "nolocalcheck" are set simultaneously.

  • "veryfew" forces the function to proceed as if x has only a few number of mass points (i.e., distinct values). In other words, forces the function to proceed as if the mass point and degrees of freedom checks were failed.

weights

an optional vector of weights to be used in the fitting process. Should be NULL or a numeric vector. For more details, see lm.

subset

optional rule specifying a subset of observations to be used.

plotxrange

a vector. plotxrange=c(min, max) specifies a range of the x-axis for plotting. Observations outside the range are dropped in the plot.

plotyrange

a vector. plotyrange=c(min, max) specifies a range of the y-axis for plotting. Observations outside the range are dropped in the plot.

qregopt

a list of optional arguments used by rq.

...

optional arguments to control bootstrapping. See boot.rq.

Value

bins_plot

A ggplot object for binscatter plot.

data.plot

A list containing data for plotting. Each item is a sublist of data frames for each group. Each sublist may contain the following data frames:

  • data.dots Data for dots. It contains: x, evaluation points; bin, the indicator of bins; isknot, indicator of inner knots; mid, midpoint of each bin; and fit, fitted values.

  • data.line Data for line. It contains: x, evaluation points; bin, the indicator of bins; isknot, indicator of inner knots; mid, midpoint of each bin; and fit, fitted values.

  • data.ci Data for CI. It contains: x, evaluation points; bin, the indicator of bins; isknot, indicator of inner knots; mid, midpoint of each bin; ci.l and ci.r, left and right boundaries of each confidence intervals.

  • data.cb Data for CB. It contains: x, evaluation points; bin, the indicator of bins; isknot, indicator of inner knots; mid, midpoint of each bin; cb.l and cb.r, left and right boundaries of the confidence band.

  • data.poly Data for polynomial regression. It contains: x, evaluation points; bin, the indicator of bins; isknot, indicator of inner knots; mid, midpoint of each bin; and fit, fitted values.

  • data.polyci Data for confidence intervals based on polynomial regression. It contains: x, evaluation points; bin, the indicator of bins; isknot, indicator of inner knots; mid, midpoint of each bin; polyci.l and polyci.r, left and right boundaries of each confidence intervals.

  • data.bin Data for the binning structure. It contains: bin.id, ID for each bin; left.endpoint and right.endpoint, left and right endpoints of each bin.

imse.var.rot

Variance constant in IMSE, ROT selection.

imse.bsq.rot

Bias constant in IMSE, ROT selection.

imse.var.dpi

Variance constant in IMSE, DPI selection.

imse.bsq.dpi

Bias constant in IMSE, DPI selection.

cval.by

A vector of critical values for constructing confidence band for each group.

opt

A list containing options passed to the function, as well as N.by (total sample size for each group), Ndist.by (number of distinct values in x for each group), Nclust.by (number of clusters for each group), and nbins.by (number of bins for each group), and byvals (number of distinct values in by). The degree and smoothness of polynomials for dots, line, confidence intervals and confidence band for each group are saved in dots, line, ci, and cb.

Author(s)

Matias D. Cattaneo, Princeton University, Princeton, NJ. cattaneo@princeton.edu.

Richard K. Crump, Federal Reserve Bank of New York, New York, NY. richard.crump@ny.frb.org.

Max H. Farrell, UC Santa Barbara, Santa Barbara, CA. mhfarrell@gmail.com.

Yingjie Feng (maintainer), Tsinghua University, Beijing, China. fengyingjiepku@gmail.com.

References

Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024a: On Binscatter. American Economic Review 114(5): 1488-1514.

Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024b: Nonlinear Binscatter Methods. Working Paper.

Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024c: Binscatter Regressions. Working Paper.

See Also

binsregselect, binstest.

Examples

 x <- runif(500); y <- sin(x)+rnorm(500)
 ## Binned scatterplot
 binsqreg(y,x)

[Package binsreg version 1.1 Index]