dfl_decompose {ddecompose} | R Documentation |
DFL reweighting decomposition
Description
dfl_decompose
divides between-group differences in distributional
statistics of an outcome variable into a structure effect and a composition
effect. Following DiNardo, Fortin, and Lemieux (1996), the procedure reweights
the sample distribution of a reference group such that the group's covariates
distribution matches the covariates distribution of a comparison group.
The function derives counterfactual distributions with inverse probability weigthing. Reweighting factors are estimate by modelling the probability of belonging to the comparison group conditional on covariates.
The function allows detailed decompositions of the composition effect by sequentially reweighting (conditional) covariate distributions. Standard errors can be bootstrapped.
Usage
dfl_decompose(
formula,
data,
weights,
group,
na.action = na.exclude,
reference_0 = TRUE,
subtract_1_from_0 = FALSE,
right_to_left = TRUE,
method = "logit",
estimate_statistics = TRUE,
statistics = c("quantiles", "mean", "variance", "gini", "iq_range_p90_p10",
"iq_range_p90_p50", "iq_range_p50_p10"),
probs = c(1:9)/10,
custom_statistic_function = NULL,
trimming = FALSE,
trimming_threshold = NULL,
return_model = TRUE,
estimate_normalized_difference = TRUE,
bootstrap = FALSE,
bootstrap_iterations = 100,
bootstrap_robust = FALSE,
cores = 1,
...
)
Arguments
formula |
a |
data |
a |
weights |
name of the observation weights variable or vector of observation weights. |
group |
name of a binary variable (numeric or factor) identifying the
two groups for which the differences are to be decomposed. The group
identified by the lower ranked value in |
na.action |
a function to filter missing data (default |
reference_0 |
boolean: if |
subtract_1_from_0 |
boolean: By default ('FALSE'), the distributional statistic of group 0 is subtracted from the one of group 1 to compute the overall difference. Setting 'subtract_1_from_0' to 'TRUE' merely changes the sign of the decomposition results. |
right_to_left |
determines the direction of a sequential decomposition.
If |
method |
specifies the method to fit and predict conditional probabilities
used to derive the reweighting factor. At the moment, |
estimate_statistics |
boolean: if |
statistics |
a character vector that defines the distributional statistics
for which the decomposition is performed. Per default,
|
probs |
a vector of length 1 or more with the probabilities of the quantiles
to be estimated with default |
custom_statistic_function |
a function estimating a custom distributional statistic
that will be decomposed ( |
trimming |
boolean: If |
trimming_threshold |
numeric: threshold defining the maximal accepted
relative weight of the reweighting factor value (i.e., inverse probability weight)
of a single observation. If |
return_model |
boolean: If |
estimate_normalized_difference |
boolean: If |
bootstrap |
boolean: If |
bootstrap_iterations |
positive integer with default |
bootstrap_robust |
boolean: if |
cores |
positive integer with default |
... |
other parameters passed to the function estimating the conditional probabilities. |
Details
The observed difference to be decomposed equals the difference between the values
of the distributional statistic of group
1 and group
0, respectively:
\Delta_O = \nu_1 - \nu_0,
where \nu_t = \nu(F_g)
denotes the statistics of the outcome distribution
F_g
of group g
. Group 0 is identified by the lower ranked value
of the group
variable.
If reference_0=TRUE
, then group 0 is the reference group and its observations
are reweighted such that they match the covariates distribution of group 1, the
comparison group. The counterfactual combines the covariates distribution
F_1(x)
of group 1 with the conditional outcome distribution F_0(y|x)
of group 0 and is derived by reweighting group 0
F_C(y) = \int F_0(y|x) dF_1(x) = \int F_0(y|x) \Psi(x) dF_0(x),
where \Psi(x)
is the reweighting factor, i.e., the inverse probabilities
of belonging to the comparison group conditional on covariates x.
The distributional statistic of the counterfactual distribution,
\nu_C = \nu(F_C)
, allows to decompose the observed difference into
a (wage) structure effect (\Delta_S = \nu_1 - \nu_C
) and a
composition effect (\Delta_C = \nu_C - \nu_0
).
If reference_0=FALSE
, then the counterfactual is derived by combining
the covariates distribution of group 0 with the conditional outcome
distribution of group 1 and, thus, reweighting group 1
F_C(y) = \int F_1(y|x) dF_0(x) = \int F_1(y|x) \Psi(x) dF_1(x).
The composition effect becomes \Delta_C = \nu_1 - \nu_C
and the
structure effect \Delta_S = \nu_C - \nu_0
, respectively.
The covariates are defined in formula
. The reweighting factor is
estimated in the pooled sample with observations from both groups. method = "logit"
uses a logit model to fit the conditional probabilities. method = "fastglm"
also fits a logit model but with a faster algorithm from fastglm.
method = "random_forest"
uses the Ranger implementation of
the random forests classifier.
The counterfactual statistics are then estimated with the observed data of the reference group and the fitted reweighting factors.
formula
allows to specify interaction terms in the conditional
probability models. If you are interested in an aggregate decomposition,
then all covariates have to be entered at once, e.g., Y ~ X + Z
.
The procedure allows for sequential decomposition of the composition effect. In this case, more than one reweighting factor based on different sets of covariates are estimated.
If you are interested in a sequential decomposition, the decomposition
sequence has to be distinguished by the |
operator in the formula
object. For instance, Y ~ X | Z
would decompose the aggregate composition
effect into the contribution of covariate(s) X and the one of covariate(s) Z,
respectively.
In this two-fold sequential decomposition, we have the detailed composition effects
\Delta_{C_X} = \nu_1 - \nu_{CX},
and
\Delta_{C_Z} = \nu_{CX} - \nu_C,
which sum up to the aggregate composition effect \Delta_C
.
\nu_C
is defined as above. It captures the contribution of all
covariates (i.e., X and Z). In contrast, \nu_{CX}
corresponds
to the statistic of the counterfactual distribution isolating the contribution
of covariate(s) X in contrast to the one of covariate(s) Z.
If right_to_left=TRUE
, then the counterfactual is defined as
F_{CX}(y) = \iint F_0(y|x,z) dF_0(x|z) dF_1(z),
where F_1(x|z)
is the conditional distribution of X given Z of
group 1 and F_0(z)
the distribution of Z. If right_to_left=FALSE
,
we have
F_{CX}(y) = \iint F_0(y|x,z) dF_1(x|z) dF_0(z).
Note that it is possible to specify the detailed models in every part of formula
.
This is useful if you want to estimate in every step a fully saturated model,
e.g., Y ~ X * Z | Z
. If not further specified, the variables are
additively included in the model used to derived the aggregate reweighting
factor.
The detailed decomposition terms are path-dependent. The results depend on the sequence
the covariates enter the decomposition (e.g, Y ~ X | Z
yields different
detailed decomposition terms than Y ~ Z | X
) . Even for the same sequence,
the results differ depending on the 'direction' of the decomposition. In
the example above using right_to_left=TRUE
, the contribution of Z is evaluated
using the conditional distribution of X given Z from group 0. If we use
right_to_left=FALSE
instead, the same contribution is evaluated using
the conditional distribution from group 1.
Per default, the distributional statistics for which the between group differences are decomposed are quantiles, the mean, the variance, the Gini coefficient and the interquantile range between the 9th and the 1st decile, the 9th decile and the median, and the median and the first decile, respectively. The interquantile ratios between the same quantiles are implemented, as well.
The quantiles can be specified by probs
that sets the corresponding
probabilities of the quantiles of interest. For other distributional statistics,
please use custom_statistic_function
The function bootstraps standard errors and derives a bootstrapped Kolmogorov-Smirnov distribution to construct uniform confindence bands. The Kolmogorov-Smirnov distribution is estimated as in Chen et al. (2017).
Value
an object of class dfl_decompose
containing a data.frame with the
decomposition results for the quantiles and for the other distributional
statistics, respectively, a data.frame with the estimated reweighting factor
for every observation, a data.frame with sample quantiles of the reweighting
factors and a list with standard errors for the decomposition terms, the
quantiles of the reweighting factor, the bootstrapped
Kolmogorov-Smirnov distribution to construct uniform confidence bands for
quantiles, as well as a list with the normalized differences between the
covariate means of the comparison group and the reweighted reference group.
References
Chen, Mingli, Victor Chernozhukov, Iván Fernández-Val, and Blaise Melly. 2017. "Counterfactual: An R Package for Counterfactual Analysis." *The R Journal* 9(1): 370-384.
DiNardo, John, Nicole M. Fortin, and Thomas Lemieux. 1996. "Labor Market Institutions and the Distribution of Wages, 1973-1992: A Semiparametric Approach." Econometrica, 64(5), 1001-1044.
Firpo, Sergio P., Nicole M. Fortin, and Thomas Lemieux. 2018. "Decomposing Wage Distributions Using Recentered Influence Function Regressions." Econometrics 6(2), 28.
Fortin, Nicole M., Thomas Lemieux, and Sergio Firpo. 2011. "Decomposition methods in economics." In Orley Ashenfelter and David Card, eds., Handbook of Labor Economics. Vol. 4. Elsevier, 1-102.
Firpo, Sergio P., and Cristine Pinto. 2016. "Identification and Estimation of Distributional Impacts of Interventions Using Changes in Inequality Measures." Journal of Applied Econometrics, 31(3), 457-486.
Huber, Martin, Michael Lechner, and Conny Wunsch. 2013. "The performance of estimators based on the propensity score." Journal of Econometrics, 175(1), 1-21.
Examples
## Example from handbook chapter of Fortin, Lemieux, and Firpo (2011: 67)
## with a sample of the original data
data("men8305")
flf_model <- log(wage) ~ union * (education + experience) + education * experience
# Reweighting sample from 1983-85
flf_male_inequality <- dfl_decompose(flf_model,
data = men8305,
weights = weights,
group = year
)
# Summarize results
summary(flf_male_inequality)
# Plot decomposition of quantile differences
plot(flf_male_inequality)
# Use alternative reference group (i.e., reweight sample from 2003-05)
flf_male_inequality_reference_0305 <- dfl_decompose(flf_model,
data = men8305,
weights = weights,
group = year,
reference_0 = FALSE
)
summary(flf_male_inequality_reference_0305)
# Bootstrap standard errors (using smaller sample for the sake of illustration)
set.seed(123)
flf_male_inequality_boot <- dfl_decompose(flf_model,
data = men8305[1:1000, ],
weights = weights,
group = year,
bootstrap = TRUE,
bootstrap_iterations = 100,
cores = 1
)
# Get standard errors and confidence intervals
summary(flf_male_inequality_boot)
# Plot quantile differences with pointwise confidence intervals
plot(flf_male_inequality_boot)
# Plot quantile differences with uniform confidence intervals
plot(flf_male_inequality_boot, uniform_bands = TRUE)
## Sequential decomposition
# Here we distinguish the contribution of education and experience
# from the contribution of unionization conditional on education and experience.
model_sequential <- log(wage) ~ union * (education + experience) +
education * experience |
education * experience
# First variant:
# Contribution of union is evaluated using composition of
# education and experience from 2003-2005 (group 1)
male_inequality_sequential <- dfl_decompose(model_sequential,
data = men8305,
weights = weights,
group = year
)
# Summarize results
summary(male_inequality_sequential)
# Second variant:
# Contribution of union is evaluated using composition of
# education and experience from 1983-1985 (group 0)
male_inequality_sequential_2 <- dfl_decompose(model_sequential,
data = men8305,
weights = weights,
group = year,
right_to_left = FALSE
)
# Summarize results
summary(male_inequality_sequential_2)
# The domposition effects associated with (conditional) unionization for deciles
cbind(
male_inequality_sequential$decomposition_quantiles$prob,
male_inequality_sequential$decomposition_quantiles$`Comp. eff. X1|X2`,
male_inequality_sequential_2$decomposition_quantiles$`Comp. eff. X1|X2`
)
## Trim observations with weak common support
## (i.e. observations with relative factor weights > \sqrt(N)/N)
set.seed(123)
data_weak_common_support <- data.frame(
d = factor(c(
c("A", "A", rep("B", 98)),
c(rep("A", 90), rep("B", 10))
)),
group = rep(c(0, 1), each = 100)
)
data_weak_common_support$y <- ifelse(data_weak_common_support$d == "A", 1, 2) +
data_weak_common_support$group +
rnorm(200, 0, 0.5)
decompose_results_trimmed <- dfl_decompose(y ~ d,
data_weak_common_support,
group = group,
trimming = TRUE
)
identical(
decompose_results_trimmed$trimmed_observations,
which(data_weak_common_support$d == "A")
)
## Pass a custom statistic function to decompose income share of top 10%
top_share <- function(dep_var,
weights,
top_percent = 0.1) {
threshold <- Hmisc::wtd.quantile(dep_var, weights = weights, probs = 1 - top_percent)
share <- sum(weights[which(dep_var > threshold)] *
dep_var[which(dep_var > threshold)]) /
sum(weights * dep_var)
return(share)
}
flf_male_inequality_custom_stat <- dfl_decompose(flf_model,
data = men8305,
weights = weights,
group = year,
custom_statistic_function = top_share
)
summary(flf_male_inequality_custom_stat)