R: Oaxaca-Blinder decomposition

ob_decompose {ddecompose}

R Documentation

Oaxaca-Blinder decomposition

Description

ob_decompose implements the Oaxaca-Blinder decomposition that divides differences in the mean outcome between two groups into one part explained by different covariate means (composition effect) and into another part due to differences in linear regression coefficients linking covariates to the outcome variable (structure effect).

The function allows for 'doubly robust' decompositions where the sample of one group is reweighted such that it matches the covariates distribution of the other group before the regression coefficients are estimated.

For distributional statistics beyond the mean, the function performs the RIF regression decomposition proposed by Firpo, Fortin, and Lemieux (2018).

Usage

ob_decompose(
  formula,
  data,
  group,
  weights = NULL,
  reweighting = FALSE,
  normalize_factors = FALSE,
  reference_0 = TRUE,
  subtract_1_from_0 = FALSE,
  reweighting_method = "logit",
  trimming = FALSE,
  trimming_threshold = NULL,
  rifreg_statistic = NULL,
  rifreg_probs = c(1:9)/10,
  custom_rif_function = NULL,
  na.action = na.omit,
  bootstrap = FALSE,
  bootstrap_iterations = 100,
  bootstrap_robust = FALSE,
  cluster = NULL,
  cores = 1,
  vcov = stats::vcov,
  ...
)

Arguments

`formula`	a `formula` object with an outcome variable Y on the left-hand side and the covariates X on the right-hand side. If `reweighting = TRUE`, the same covariates are used to estimate the conditional probabilities for the reweighting factor. A different model for estimating the conditional probabilities can be defined after a `\|` operator on the right-hand side.
`data`	a data frame containing the variables in the model.
`group`	name of the a binary variable (numeric or factor) identifying the two groups that will be compared. The group identified by the lower ranked value in 'group' (i.e., 0 in the case of a dummy variable or the first level of factor variable) is defined as group 0.
`weights`	numeric vector of non-negative observation weights, hence of same length as `dep_var`. The default (`NULL`) is equivalent to `weights = rep(1, length(dep_var))`. If no weights are used, make sure you do not define this parameter (e.g. with `weights = NULL`).
`reweighting`	boolean: if 'TRUE', then the decomposition is performed with with respect to reweighted reference group yielding either a 'doubly robust' Oaxaca-Blinder decomposition or a reweighted RIF decomposition.
`normalize_factors`	boolean: If 'TRUE', then factor variables are normalized as proposed by Gardeazabal/Ugidos (2004) and results are not dependent on the factor's reference group. Per default (`normalize_factors = FALSE`) and factors are not normalized.
`reference_0`	boolean: if 'TRUE' (default), then the group 0 – i.e., the group identified by the lower ranked value in 'group' – will be defined as reference group. The reference group will be reweighted to match the covariates distribution of the counterfactual sample. By default, the composition effect is computed as `(X1 - X0) * b0` and the structure effect as `X1 * (b1 - b0)`. Putting `reference_0 = FALSE` changes the reference structure. Hence, the composition effect is computed as `(X1 - X0) * b1` and the structure effect as `X0 * (b1 - b0)`.
`subtract_1_from_0`	boolean: By default ('FALSE'), X0 is subtracted from X1 and beta0 from beta1 (X1b1 - X0b0) to compute the overall difference. Setting 'subtract_1_from_0' to 'TRUE' merely changes the sign of the decomposition results. This means the composition effect is computed as `(X0 - X1) * b1` and the structure effect as `X0 * (b0 - b1)`.
`reweighting_method`	specifies the method fit and predict conditional probabilities used to derive the reweighting factor. Currently, `"logit"`, `"fastglm"`, and `"random_forest"` are available.
`trimming`	boolean: If `TRUE`, observations with dominant reweighting factor values are trimmend according to rule of Huber, Lechner, and Wunsch (2013). Per default, trimming is set to `FALSE`.
`trimming_threshold`	numeric: threshold defining the maximal accepted relative weight of the reweighting factor value (i.e., inverse probability weight) of a single observation. If `NULL`, the threshold is set to `sqrt(N)/N`, where `N` is the number of observations in the reference group.
`rifreg_statistic`	string containing the distributional statistic for which to compute the RIF. If 'NULL' (default), no RIF regression decomposition is computed. If an available statistic is selected, 'ob_decompose' estimates a RIF regression decomposition. The 'rifreg_statistic' can be one of "quantiles", "mean", "variance", "gini", "interquantile_range", "interquantile_ratio", or "custom". If "custom" is selected, a `custom_rif_function` needs to be provided.
`rifreg_probs`	a vector of length 1 or more with probabilities of quantiles. Each quantile is indicated with a value between 0 and 1. Default is `c(1:9)/10`. If `statistic = "quantiles"`, a single RIF regression for every quantile in `probs` is estimated. An interquantile ratio (range) is defined by the ratio (difference) between the `max(probs)`-quantile and the `min(probs)`-quantile.
`custom_rif_function`	the RIF function to compute the RIF of the custom distributional statistic. Default is NULL. Only needs to be provided if `statistic = "custom"`. Every custom_rif_function needs the parameters `dep_var`, `weights` and `probs`. If they are not needed, they must be set to NULL in the function definition (e.g. `probs = NULL`). A custom function must return a data frame containing at least a "rif" and "weights" column. See `examples` for further details.
`na.action`	generic function that defines how NAs in the data should be handled. Default is `na.omit`, leading to exclusion of observations that contain one or more missings. See na.action for further details.
`bootstrap`	boolean: If 'FALSE' (default), then no bootstrapped standard errors are calculated and, in the case of a standard Oaxaca-Blinder decomposition, analytical standard errors are estimated (assuming independence between groups).
`bootstrap_iterations`	positive integer indicating the number of bootstrap iterations to execute. Only required if `bootstrap = TRUE`.
`bootstrap_robust`	boolean: if 'FALSE' (default), then bootstrapped standard errors are estimated as the standard deviations of the bootstrapp estimates. Otherwise, the function uses the bootstrap interquartile range rescaled by the interquantile range of the standard distribution to estimate standard errors.
`cluster`	numeric vector of same length as `dep_var` indicating the clustering of observations. If `cluster = NULL` (default), no clustering is a assumend and bootstrap procedure resamples individual observations. Otherwise bootstrap procedure resamples clusters.
`cores`	positive integer indicating the number of cores to use when computing bootstrap standard errors. Only required if `bootstrap = TRUE`.
`vcov`	function estimating covariance matrix of regression coefficients if standard errors are not bootstrapped (i.e., `bootstrap = FALSE`). By default, vcov is used assuming homoscedastic errors.
`...`	additional parameters passed to the custom_rif_function. Apart from dep_var, weights and probs they must have a different name than the the ones in rifreg. For instance, if you want to pass a parameter statistic to the custom_rif_function, name it custom_statistic. Additional parameters can also be passed to the density function used to estimate the RIF of quantiles.

Details

ob_decompose() contains for four different decomposition methods of observed group differences.

1. The original Oaxaca-Blinder decomposition (default) 2. A 'doubly robust Oaxaca-Blinder decomposition (reweighting=TRUE) 3. A RIF Regression decomposition. (e.g., rifreg_statistic="quantiles") 4. A reweighted RIF regression decomposition. (reweighting=TRUE and rifreg_statistic="quantiles")

The doubly robust OB decomposition is a robust and path independent alternative for detailed decompositions at the mean. is to combine reweighting with the linear Oaxaca-Blinder method (see Fortin et al., 2011: 48-51). This approach has the valuable side effect of accounting for potential errors introduced by an incomplete inverse probability weighting and the linear model specification, respectively.

A path independent method that goes beyond the mean is the RIF decomposition of Firpo, Fortin, and Lemieux (2018). The approach approximates the expected value of the 'recentered influence function' (RIF) of the distributional statistic (e.g., quantile, variance, or Gini coefficient) of an outcome variable conditional on covariates with linear regressions. RIF regression coefficients can be consistent estimates of the marginal effect of a small change in the expected value of a covariate to the distributional statistics of an outcome variable (see documentation of the companion package rifreg). Thus, they can be used to decompose between-group difference in distributional statistics. Firpo et al. (2018) combine the RIF regressions again with the reweighting estimator to avoid specification errors.

Value

an object of class ob_decompose containing a data.frame with the decomposition results for the quantiles and for the other distributional statistics, respectively, a data.frame with the estimated reweighting factor for every observation, a data.frame with sample quantiles of the reweighting factors and a list with standard errors for the decomposition terms, the quantiles of the reweighting factor, the bootstrapped Kolmogorov-Smirnov distribution to construct uniform confidence bands for quantiles, as well as a list with the normalized differences between the covariate means of the comparison group and the reweighted reference group.

A list object of class 'ob_decompose' containing the following components:

- 'ob_decompose': A list containing the decomposition results, covariance matrix, model fits and more detailed result information. - 'group_variable_name': A string indicating the name of the group variable. - 'group_variable_levels': A string indicating the levels of the group variable. - 'reference_group': A string indicating the which level of the group variable was used as reference group. - 'reweighting_estimates': A list containing the reweighting estimates if reweighting=TRUE, else (NA) - 'input_parameters': A list of input parameters used for the estimation.

References

Firpo, Sergio, Nicole M. Fortin, and Thomas Lemieux. 2018. "Decomposing Wage Distributions Using Recentered Influence Function Regressions." Econometrics, 6(2):28.

Fortin, Nicole, Thomas Lemieux, and Sergio Firpo. 2011. "Decomposition methods in economics." In Orley Ashenfelter and David Card, eds., Handbook of labor economics. Vol. 4. Elsevier, 1-102.

Gardeazabal, Javier, and Arantza Ugidos. 2004. "More on identification in detailed wage decompositions." Review of Economics and Statistics, 86(4): 1034-1036.

Examples


## Oaxaca-Blinder decomposition of gender wage gap
## with NLYS79 data like in Fortin, Lemieux, & Firpo (2011: 41)

data("nlys00")

mod1 <- log(wage) ~ age + central_city + msa + region + black +
  hispanic + education + afqt + family_responsibility + years_worked_civilian +
  years_worked_military + part_time + industry

# Using female coefficients (reference_0 = TRUE) to estimate counterfactual mean
decompose_female_as_reference <- ob_decompose(
  formula = mod1,
  data = nlys00,
  group = female,
  reference_0 = TRUE
)
decompose_female_as_reference

# Using male coefficients (reference_0 = FALSE)
decompose_male_as_reference <- ob_decompose(
  formula = mod1,
  data = nlys00,
  group = female,
  reference_0 = FALSE
)
decompose_male_as_reference

# Replicate first and third column in Table 3 in Fortin, Lemieux, & Firpo (2011: 41)
# Define aggregation of decomposition terms
custom_aggregation <- list(
  `Age, race, region, etc.` = c(
    "age",
    "blackyes",
    "hispanicyes",
    "regionNorth-central",
    "regionSouth",
    "regionWest",
    "central_cityyes",
    "msayes"
  ),
  `Education` = c(
    "education<10 yrs",
    "educationHS grad (diploma)",
    "educationHS grad (GED)",
    "educationSome college",
    "educationBA or equiv. degree",
    "educationMA or equiv. degree",
    "educationPh.D or prof. degree"
  ),
  `AFTQ` = "afqt",
  `L.T. withdrawal due to family` = "family_responsibility",
  `Life-time work experience` = c(
    "years_worked_civilian",
    "years_worked_military",
    "part_time"
  ),
  `Industrial sectors` = c(
    "industryManufacturing",
    "industryEducation, Health, Public Admin.",
    "industryOther services"
  )
)

# First column
summary(decompose_male_as_reference, custom_aggregation = custom_aggregation)

# Third column
summary(decompose_female_as_reference, custom_aggregation = custom_aggregation)

## Compare bootstrapped standard errors...
decompose_female_as_reference_bs <- ob_decompose(
  formula = mod1,
  data = nlys00,
  group = female,
  bootstrap = TRUE,
  bootstrap_iterations = 100
)
summary(decompose_female_as_reference_bs, custom_aggregation = custom_aggregation)

# ... to analytical standard errors (assuming independence between groups and
# homoscedasticity)
decompose_female_as_reference <- ob_decompose(
  formula = mod1,
  data = nlys00,
  group = female,
  reference_0 = TRUE
)
summary(decompose_female_as_reference, custom_aggregation = custom_aggregation)

# Return standard errors for all detailed terms
summary(decompose_female_as_reference, aggregate_factors = FALSE)


## 'Doubly robust' Oaxaca-Blinder decomposition of gender wage gap
mod2 <- log(wage) ~ age + central_city + msa + region + black +
  hispanic + education + afqt + family_responsibility + years_worked_civilian +
  years_worked_military + part_time + industry | age + (central_city + msa) * region + (black +
  hispanic) * (education + afqt) + family_responsibility * (years_worked_civilian +
  years_worked_military) + part_time * industry
decompose_male_as_reference_robust <- ob_decompose(
  formula = mod2,
  data = nlys00,
  group = female,
  reference_0 = FALSE,
  reweighting = TRUE
)

# ... using random forests instead of logit to estimate weights
decompose_male_as_reference_robust_rf <- ob_decompose(
  formula = mod1,
  data = nlys00,
  group = female,
  reference_0 = FALSE,
  reweighting = TRUE,
  method = "random_forest"
)


# Reweighted RIF Regression Decomposition
data("men8305")

model_rifreg <- log(wage) ~  union + education + experience |
  union * (education + experience) + education * experience

# Variance
variance_decomposition <- ob_decompose(
  formula = model_rifreg,
  data = men8305,
  group = year,
  reweighting = TRUE,
  rifreg_statistic = "variance"
)

# Deciles
deciles_decomposition <- ob_decompose(
  formula = model_rifreg,
  data = men8305,
  group = year,
  reweighting = TRUE,
  rifreg_statistic = "quantiles",
  rifreg_probs = c(1:9) / 10
)

# plot(deciles_decomposition)

# RIF regression decomposition with custom function

# custom function
custom_variance_function <- function(dep_var, weights, probs = NULL) {
  weighted_mean <- weighted.mean(x = dep_var, w = weights)
  rif <- (dep_var - weighted_mean)^2
  rif <- data.frame(rif, weights)
  names(rif) <- c("rif_variance", "weights")
  return(rif)
}

 custom_decomposition <-
   ob_decompose(
    formula = model_rifreg,
    data = men8305,
    group = year,
    reweighting = TRUE,
    rifreg_statistic = "custom",
    custom_rif_function = custom_variance_function
  )

[Package ddecompose version 1.0.0 Index]