birdie {birdie} | R Documentation |
Fit BIRDiE Models
Description
Fits one of three possible Bayesian Instrumental Regression for Disparity Estimation (BIRDiE) models to BISG probabilities and covariates. The simplest Categorical-Dirichlet model ([cat_dir()]) is appropriate when there are no covariates or when all covariates are discrete and fully interacted with another. The more general Categorical mixed-effects model ([cat_mixed()]) is a supports any number of fixed effects and up to one random intercept. For continuous outcomes a Normal linear model is available ([gaussian()]).
Usage
birdie(
r_probs,
formula,
data,
family = cat_dir(),
prior = NULL,
weights = NULL,
algorithm = c("em", "gibbs", "em_boot"),
iter = 400,
warmup = 50,
prefix = "pr_",
ctrl = birdie.ctrl()
)
Arguments
r_probs |
A data frame or matrix of BISG probabilities, with one row per individual. The output of [bisg()] can be used directly here. |
formula |
A two-sided formula object describing the model structure. The left-hand side is the outcome variable, which must be discrete. A single random intercept term, denoted with a vertical bar ('"(1 | <term>)"'), is supported on the right-hand side. |
data |
An optional data frame containing the variables named in 'formula'. |
family |
A description of the complete-data model type to fit. Options are: - [cat_dir()]: Categorical-Dirichlet model. All covariates must be fully interacted. - [cat_mixed()]: Categorical mixed-effects model. Up to one random effect is supported. - [gaussian()]: Linear model. See the Details section below for more information on the various models. |
prior |
A list with entries specifying the model prior. - For the 'cat_dir' model, the only entry is 'alpha', which should be a matrix
of Dirichlet hyperparameters. The matrix should have one row for every
level of the outcome variable and one column for every racial group. The
default prior (used when 'prior=NULL') is an empirical Bayes prior equal to
the weighted-mean estimate of the outcome-race table. A fully
noninformative prior with all entries set to The prior is stored after model fitting in the '$prior' element of the fitted model object. |
weights |
An optional numeric vector specifying likelihood weights. |
algorithm |
The inference algorithm to use. One of 3 options: - '"em"': An expectation-maximization algorithm which will perform inference for the maximum a posteriori (MAP) parameter values. Computationally efficient and supported by all the model families. No uncertainty quantification. - '"gibbs"': A Gibbs sampler for performing full Bayesian inference. Generally more computationally demanding than the EM algorithm, but provides uncertainty quantification. Currently supported for 'cat_dir()' and 'gaussian()' model families. Computation-reliability tradeoff can be controlled with 'iter' argument. - '"em_boot"': Bootstrapped version of EM algorithm. Number of bootstrap replicates controlled by 'iter' parameter. Provides approximate uncertainty quantification. Currently supported for 'cat_dir()' and 'gaussian()' model families. |
iter |
The number of post-warmup Gibbs samples, or the number of bootstrap replicates to use to compute approximate standard errors for the main model estimates. Only available when 'family=cat_dir()' or 'gaussian()'. Ignored if 'algorithm="em"'. For bootstrapping, when there are fewer than 1,000 individuals or 100 or
fewer replicates, a Bayesian bootstrap is used instead (i.e., weights are
drawn from a |
warmup |
Number of warmup iterations for Gibbs sampling. Ignored unless 'algorithm="gibbs"'. |
prefix |
If 'r_probs' is a data frame, the columns containing racial probabilities will be selected as those with names starting with 'prefix'. The default will work with the output of [bisg()]. |
ctrl |
A list containing control parameters for the EM algorithm and optimization routines. A list in the proper format can be made using [birdie.ctrl()]. |
Details
By default, 'birdie()' uses an expectation-maximization (EM) routine to find the maximum *a posteriori* (MAP) estimate for the specified model. Asymptotic variance-covariance matrices for the MAP estimate are available for the Categorical-Dirichlet and Normal linear models via bootstrapping. Full Bayesian inference is supported via Gibbs sampling for the Categorical-Dirichlet and Normal linear models as well.
Whatever model or method is used, a finite-population estimate of the outcome-given-race distribution for the entire observed sample is always calculated and stored as '$est' in the returned object, which can be accessed with [coef.birdie()] as well.
The Categorical-Dirichlet model is specified as follows:
Y_i \mid R_i, X_i, \Theta \sim \text{Categorical}(\theta_{R_iX_i}) \\
\theta_{rx} \sim \text{Dirichlet}(\alpha_r),
where Y
is the outcome variable, R
is race, X
are
covariates (fixed effects), and \theta_{rx}
and \alpha_r
are
vectors with length matching the number of levels of the outcome variable.
There is one vector \theta_{rx}
for every combination of race and
covariates, hence the need for 'formula' to either have no covariates or a
fully interacted structure.
The Categorical mixed-effects model is specified as follows:
Y_i \mid R_i, X_i, \Theta \sim \text{Categorical}(g^{-1}(\mu_{R_iX_i})) \\
\mu_{rxy} = W\beta_{ry} + Zu_{ry} \\
u_{r} \mid \vec\sigma_{r}, L_r \sim \mathcal{N}(0,
\text{diag}(\vec\sigma_{r})C_r\text{diag}(\vec\sigma_{r})) \\
\beta_{ry} \sim \mathcal{N}(0, s^2_{r\beta}) \\
\sigma_{ry} \sim \text{Inv-Gamma}(4, 3s_{r\sigma}) \\
C_r \sim \text{LKJ}(2),
where \beta_{ry}
are the fixed effects, u_{ry}
is the random
intercept, and g
is a softmax link function.
Estimates for \beta_{ry}
and \sigma_{ry}
are stored in the
'$beta' and '$sigma' elements of the fitted model object.
The Normal linear model is specified as follows:
Y_i \mid R_i, \vec X_i, \Theta \sim \mathcal{N}(\vec X_i^\top\vec\theta, \sigma^2) \\
\sigma^2 \sim \text{Inv-Gamma}(n_\sigma/2, l_\sigma^2 n_\sigma/2) \\
\beta_{\text{intercept}} \sim \mathcal{N}(0, s^2_\text{int}) \\
\beta_k \sim \mathcal{N}(0, s^2_\beta), \\
where \vec\theta
is a vector of linear model coefficients.
Estimates for \theta
and \sigma
are stored in the
'$beta' and '$sigma' elements of the fitted model object.
More details on the models and their properties may be found in the paper referenced below.
Value
An object of class ['birdie'][birdie::birdie-class], for which many methods are available. The model estimates may be accessed with [coef.birdie()], and updated BISG probabilities (conditioning on the outcome) may be accessed with [fitted.birdie()]. Uncertainty estimates, if available, can be accessed with '$se' and [vcov.birdie()].
References
McCartan, C., Fisher, R., Goldin, J., Ho, D.E., & Imai, K. (2024). Estimating Racial Disparities when Race is Not Observed. Available at https://www.nber.org/papers/w32373.
Examples
data(pseudo_vf)
r_probs = bisg(~ nm(last_name) + zip(zip), data=pseudo_vf)
# Process zip codes to remove missing values
pseudo_vf$zip = proc_zip(pseudo_vf$zip)
fit = birdie(r_probs, turnout ~ 1, data=pseudo_vf)
print(fit)
fit$se # uncertainty quantification
fit = birdie(r_probs, turnout ~ zip, data=pseudo_vf, algorithm="gibbs")
fit = birdie(r_probs, turnout ~ (1 | zip), data=pseudo_vf,
family=cat_mixed(), ctrl=birdie.ctrl(abstol=1e-3))
summary(fit)
coef(fit)
fitted(fit)