gapclosing {gapclosing}R Documentation

Gap closing estimator

Description

A function to estimate gap-closing estimands: means and disparities across categories of units that would persist under some counterfactual assignment of a treatment. To use this function, the user provides a data frame data, a rule counterfactual_assignments for counterfactually assigning treatment, a treatment and/or an outcome model for learning statistically about the counterfactuals, and the category_name of the variable in data over which categories are defined. The returned object summarizes factual and counterfactual means and disparities. Supported estimation algorithms include generalized linear models, ridge regression, generalized additive models, and random forests. Standard errors are supported by bootstrapping.

Usage

gapclosing(
  data,
  counterfactual_assignments,
  outcome_formula = NULL,
  treatment_formula = NULL,
  category_name,
  outcome_name = NULL,
  treatment_name = NULL,
  treatment_algorithm = "glm",
  outcome_algorithm = "lm",
  sample_split = "single_sample",
  se = FALSE,
  bootstrap_samples = 1000,
  bootstrap_method = "simple",
  parallel_cores = NULL,
  weight_name = NULL,
  n_folds = 2,
  folds_name = NULL
)

Arguments

data

Data frame containing the observed data

counterfactual_assignments

Numeric scalar or vector of length nrow(data), each element of which is on the [0,1] interval. If a scalar, the counterfactual probability by which all units are assigned to treatment condition 1. If a vector, each element i corresponds to the counterfactual probability by which each unit i is assigned to treatment condition 1.

outcome_formula

Outcome formula , in the style outcome ~ treatment*covariate. Covariates should include those needed for causal identification of the treatment effect (e.g. as defended in your Directed Acyclic Graph). If outcome_algorithm = "ranger", then the outcome model will be fit separately on the treatment and control groups. Otherwise, the user must specify all interactions in the formula.

treatment_formula

Treatment formula, in the style treatment ~ covariate. Covariates should include those needed for causal identification of the treatment effect (e.g. as defended in your Directed Acyclic Graph).

category_name

Character name of the variable indicating the categories over which the gap is defined. Must be the name of a column in data.

outcome_name

Character name of the outcome variable. Only required when there is no outcome_formula; otherwise extracted automatically. Must be a name of a column in data.

treatment_name

Character name of the treatment variable. Only required when there is no treatment_formula; otherwise extracted automatically. Must be a name of a column in data.

treatment_algorithm

Character name of the algorithm for the treatment model. One of "glm", "ridge", "gam", or "ranger". Defaults to "glm", which is a logit model. Option "ridge" is ridge regression. Option "gam" is a generalized additive model fit (see package mgcv). Option "ranger" is a random forest (see package ranger). If "ranger", this function avoids propensity scores equal to 0 or 1 by bottom- and top-coding predicted values at .001 and .999.

outcome_algorithm

Character name of the algorithm for the outcome model. One of "lm", "ridge", "gam", or "ranger". Defaults to "lm", which is an OLS model. Option "ridge" is ridge regression. Option "gam" is a generalized additive model fit (see package mgcv). Option "ranger" is a random forest (see package ranger).

sample_split

Character for the type of sample splitting to be conducted. One of "single_sample" or "cross_fit". Defaults to "single_sample", in which case data is used for both learning the nuisance functions and aggregating to an estimate. Option "cross_fit" uses cross-fitting to repeatedly use part of the sample to learn the nuisance function and another part to estimate the estimand, averaged over repetitions with these roles swapped.

se

Logical indicating whether standard errors should be calculated. Default is FALSE. Standard errors assume a simple random sample by default; to stratify by (category x treatment), see the bootstrap_method argument. Because many datasets are not simple random samples, users should carefully consider whether a simple random sample bootstrap will accurately capture uncertainty.

bootstrap_samples

Only used if se = TRUE. Number of bootstrap samples. Default is 1000.

bootstrap_method

Only used if se = TRUE. A character string stating how to conduct bootstrap samples. If "simple", then samples are drawn with replacement from the full data. If "stratified", then the bootstrap is carried out within subpopulations defined by category and treatment. The latter may be useful if the sample contains only a small number of observations in these cells and the user wants to ensure that every (category x treatment) cell appears in every bootstrap sample. With "stratified", inference assumes that in repeated samples from the true population the proportion in each (category x treatment) cell would not change; this may or may not correspond to the true sampling process. Users should be cautious.

parallel_cores

Integer number of cores for parallel processing of the bootstrap. Defaults to sequential processing.

weight_name

Character name of a sampling weight variable, if any, which captures the inverse probability of inclusion in the sample. The default assumes a simple random sample (all weights equal).

n_folds

Only used if method = "cross_fit" and if folds is not provided. Integer scalar containing number of cross-validation folds. The function will assign observations to folds systematically: sort the data by the variable named category_name, then by the treatment variable, then at random. On this sorted dataset, folds are assigned systematically by repeated 1:n_folds. To be used if the user does not provide folds. Defaults to 2.

folds_name

Only used if method = "cross_fit". Character string indicating a column of data containing fold identifiers. This may be preferable to n_folds if the researcher has a reason to assign the folds in these data by some other process, perhaps due to particulars of how these data were generated. If null (the default), folds are assigned as stated in n_folds.

Value

An object of S3 class gapclosing, which supports summary(), print(), and plot() functions. The returned object can be coerced to a data frame of estimates with as.data.frame().
The object returned by a call to gapclosing contains several elements.

References

Lundberg I (2021). "The gap-closing estimand: A causal approach to study interventions that close disparities across social categories." Sociological Methods and Research. Available at https://osf.io/gx4y3/.

Friedman J, Hastie T, Tibshirani R (2010). "Regularization Paths for Generalized Linear Models via Coordinate Descent." Journal of Statistical Software, 33(1), 1–22. https://www.jstatsoft.org/htaccess.php?volume=33&type=i&issue=01.

Wood S (2017). Generalized Additive Models: An Introduction with R, 2 edition. Chapman and Hall/CRC.

Wright MN, Ziegler A (2017). "ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R." Journal of Statistical Software, 77(1), 1–17. doi: 10.18637/jss.v077.i01.

Examples

# Simulate example data
simulated_data <- generate_simulated_data(n = 100)

# Fit by outcome modeling
# You can add standard errors with se = TRUE
estimate <- gapclosing(
  data = simulated_data,
  outcome_formula = outcome ~ treatment * category + confounder,
  treatment_name = "treatment",
  category_name = "category",
  counterfactual_assignments = 1
)
summary(estimate)

# Fit by treatment modeling
# You can add standard errors with se = TRUE
estimate <- gapclosing(
  data = simulated_data,
  treatment_formula = treatment ~ category + confounder,
  outcome_name = "outcome",
  category_name = "category",
  counterfactual_assignments = 1
)
summary(estimate)

# Fit by doubly-robust estimation
# You can add standard errors with se = TRUE
estimate <- gapclosing(
  data = simulated_data,
  outcome_formula = outcome ~ treatment * category + confounder,
  treatment_formula = treatment ~ category + confounder,
  category_name = "category",
  counterfactual_assignments = 1
)
summary(estimate)

# Fit by doubly-robust cross-fitting estimation with random forests
# You can add standard errors with se = TRUE
estimate <- gapclosing(
  data = simulated_data,
  outcome_formula = outcome ~ category + confounder,
  treatment_formula = treatment ~ category + confounder,
  category_name = "category",
  counterfactual_assignments = 1,
  outcome_algorithm = "ranger",
  treatment_algorithm = "ranger",
  sample_split = "cross_fit"
)
summary(estimate)

[Package gapclosing version 1.0.2 Index]