birdie {birdie} | R Documentation |
Fit BIRDiE Models
Description
Fits one of three possible Bayesian Instrumental Regression for Disparity Estimation (BIRDiE) models to BISG probabilities and covariates. The simplest Categorical-Dirichlet model ([cat_dir()]) is appropriate when there are no covariates or when all covariates are discrete and fully interacted with another. The more general Categorical mixed-effects model ([cat_mixed()]) is a supports any number of fixed effects and up to one random intercept. For continuous outcomes a Normal linear model is available ([gaussian()]).
Usage
birdie(
r_probs,
formula,
data,
family = cat_dir(),
prior = NULL,
weights = NULL,
algorithm = c("em", "gibbs", "em_boot"),
iter = 400,
warmup = 50,
prefix = "pr_",
ctrl = birdie.ctrl()
)
Arguments
r_probs |
A data frame or matrix of BISG probabilities, with one row per individual. The output of [bisg()] can be used directly here. |
formula |
A two-sided formula object describing the model structure. The left-hand side is the outcome variable, which must be discrete. A single random intercept term, denoted with a vertical bar ('"(1 | <term>)"'), is supported on the right-hand side. |
data |
An optional data frame containing the variables named in 'formula'. |
family |
A description of the complete-data model type to fit. Options are: - [cat_dir()]: Categorical-Dirichlet model. All covariates must be fully interacted. - [cat_mixed()]: Categorical mixed-effects model. Up to one random effect is supported. - [gaussian()]: Linear model. See the Details section below for more information on the various models. |
prior |
A list with entries specifying the model prior. - For the 'cat_dir' model, the only entry is 'alpha', which should be a matrix
of Dirichlet hyperparameters. The matrix should have one row for every
level of the outcome variable and one column for every racial group. The
default prior (used when 'prior=NULL') is an empirical Bayes prior equal to
the weighted-mean estimate of the outcome-race table. A fully
noninformative prior with all entries set to The prior is stored after model fitting in the '$prior' element of the fitted model object. |
weights |
An optional numeric vector specifying likelihood weights. |
algorithm |
The inference algorithm to use. One of 3 options: - '"em"': An expectation-maximization algorithm which will perform inference for the maximum a posteriori (MAP) parameter values. Computationally efficient and supported by all the model families. No uncertainty quantification. - '"gibbs"': A Gibbs sampler for performing full Bayesian inference. Generally more computationally demanding than the EM algorithm, but provides uncertainty quantification. Currently supported for 'cat_dir()' and 'gaussian()' model families. Computation-reliability tradeoff can be controlled with 'iter' argument. - '"em_boot"': Bootstrapped version of EM algorithm. Number of bootstrap replicates controlled by 'iter' parameter. Provides approximate uncertainty quantification. Currently supported for 'cat_dir()' and 'gaussian()' model families. |
iter |
The number of post-warmup Gibbs samples, or the number of bootstrap replicates to use to compute approximate standard errors for the main model estimates. Only available when 'family=cat_dir()' or 'gaussian()'. Ignored if 'algorithm="em"'. For bootstrapping, when there are fewer than 1,000 individuals or 100 or
fewer replicates, a Bayesian bootstrap is used instead (i.e., weights are
drawn from a |
warmup |
Number of warmup iterations for Gibbs sampling. Ignored unless 'algorithm="gibbs"'. |
prefix |
If 'r_probs' is a data frame, the columns containing racial probabilities will be selected as those with names starting with 'prefix'. The default will work with the output of [bisg()]. |
ctrl |
A list containing control parameters for the EM algorithm and optimization routines. A list in the proper format can be made using [birdie.ctrl()]. |
Details
By default, 'birdie()' uses an expectation-maximization (EM) routine to find the maximum *a posteriori* (MAP) estimate for the specified model. Asymptotic variance-covariance matrices for the MAP estimate are available for the Categorical-Dirichlet and Normal linear models via bootstrapping. Full Bayesian inference is supported via Gibbs sampling for the Categorical-Dirichlet and Normal linear models as well.
Whatever model or method is used, a finite-population estimate of the outcome-given-race distribution for the entire observed sample is always calculated and stored as '$est' in the returned object, which can be accessed with [coef.birdie()] as well.
The Categorical-Dirichlet model is specified as follows:
where is the outcome variable,
is race,
are
covariates (fixed effects), and
and
are
vectors with length matching the number of levels of the outcome variable.
There is one vector
for every combination of race and
covariates, hence the need for 'formula' to either have no covariates or a
fully interacted structure.
The Categorical mixed-effects model is specified as follows:
where are the fixed effects,
is the random
intercept, and
is a softmax link function.
Estimates for
and
are stored in the
'$beta' and '$sigma' elements of the fitted model object.
The Normal linear model is specified as follows:
where is a vector of linear model coefficients.
Estimates for
and
are stored in the
'$beta' and '$sigma' elements of the fitted model object.
More details on the models and their properties may be found in the paper referenced below.
Value
An object of class ['birdie'][birdie::birdie-class], for which many methods are available. The model estimates may be accessed with [coef.birdie()], and updated BISG probabilities (conditioning on the outcome) may be accessed with [fitted.birdie()]. Uncertainty estimates, if available, can be accessed with '$se' and [vcov.birdie()].
References
McCartan, C., Fisher, R., Goldin, J., Ho, D.E., & Imai, K. (2024). Estimating Racial Disparities when Race is Not Observed. Available at https://www.nber.org/papers/w32373.
Examples
data(pseudo_vf)
r_probs = bisg(~ nm(last_name) + zip(zip), data=pseudo_vf)
# Process zip codes to remove missing values
pseudo_vf$zip = proc_zip(pseudo_vf$zip)
fit = birdie(r_probs, turnout ~ 1, data=pseudo_vf)
print(fit)
fit$se # uncertainty quantification
fit = birdie(r_probs, turnout ~ zip, data=pseudo_vf, algorithm="gibbs")
fit = birdie(r_probs, turnout ~ (1 | zip), data=pseudo_vf,
family=cat_mixed(), ctrl=birdie.ctrl(abstol=1e-3))
summary(fit)
coef(fit)
fitted(fit)