R: Causal mediation analysis with double machine learning

medDML {causalweight}

R Documentation

Causal mediation analysis with double machine learning

Description

Causal mediation analysis (evaluation of natural direct and indirect effects) for a binary treatment and one or several mediators using double machine learning to control for confounders based on (doubly robust) efficient score functions for potential outcomes.

Usage

medDML(
  y,
  d,
  m,
  x,
  k = 3,
  trim = 0.05,
  order = 1,
  multmed = TRUE,
  fewsplits = FALSE,
  normalized = TRUE
)

Arguments

`y`	Dependent variable, must not contain missings.
`d`	Treatment, must be binary (either 1 or 0), must not contain missings.
`m`	Mediator, must not contain missings. May be a scalar or a vector of binary, categorical, or continuous variables if `multmed` is `TRUE`. Must be a binary scalar if `multmed` is `FALSE`.
`x`	(Potential) pre-treatment confounders of the treatment, mediator, and/or outcome, must not contain missings.
`k`	Number of folds in k-fold cross-fitting if `multmed` is `FALSE`. `k`-1 folds are used for estimating the model parameters of the treatment, mediator, and outcome equations and one fold is used for predicting the efficient score functions. The roles of the folds are swapped. Default for `k` is 3. If `multmed` is `TRUE`, then 3-fold cross-valdiation is used, irrespective of the number provided in `k` (i.e. `k` is ignored if `multmed` is `TRUE`).
`trim`	Trimming rule for discarding observations with extreme conditional treatment or mediator probabilities (or products thereof). Observations with (products of) conditional probabilities that are smaller than `trim` in any denominator of the potential outcomes are dropped. Default is 0.05.
`order`	If set to an integer larger than 1, then polynomials of that order and interactions (using the power series) rather than the original control variables are used in the estimation of any conditional probability or conditional mean outcome. Polynomials/interactions are created using the `Generate.Powers` command of the `LARF` package.
`multmed`	If set to `TRUE`, a representation of direct and indirect effects that avoids conditional mediator densities/probabilities is used, see Farbmacher, Huber, Langen, and Spindler (2019). This method can incorporate multiple and/or continuous mediators. If `multmed` is `FALSE`, the representation of Tchetgen Tchetgen and Shpitser (2012) is used, which involves mediator densities. In this case, the mediator must be a binary scalar. Default of `multimed` is `TRUE`.
`fewsplits`	If set to `TRUE`, the same training data are used for estimating nested models of nuisance parameters, i.e. `E[Y\|D=d,M,X]` and `E[E[Y\|D=d,M,X]\|D=1-d,X]`. If `fewsplits` is `FALSE`, the training data are split for the sequential estimation of nested models `E[Y\|D=d,M,X]` and `E[E[Y\|D=d,M,X]\|D=1-d,X]`. This parameter is only relevant if `multmed` is `TRUE`. Default of `fewsplits` is `FALSE`.
`normalized`	If set to `TRUE`, then the inverse probability-based weights are normalized such that they add up to 1 within treatment groups. Default is `TRUE`.

Details

Estimation of causal mechanisms (natural direct and indirect effects) of a treatment under selection on observables, assuming that all confounders of the binary treatment and the mediator, the treatment and the outcome, or the mediator and the outcome are observed and not affected by the treatment. Estimation is based on the (doubly robust) efficient score functions for potential outcomes, see Tchetgen Tchetgen and Shpitser (2012) and Farbmacher, Huber, Langen, and Spindler (2019), as well as on double machine learning with cross-fitting, see Chernozhukov et al (2018). To this end, one part of the data is used for estimating the model parameters of the treatment, mediator, and outcome equations based on post-lasso regression, using the rlasso and rlassologit functions (for conditional means and probabilities, respectively) of the hdm package with default settings. The other part of the data is used for predicting the efficient score functions. The roles of the data parts are swapped and the direct and indirect effects are estimated based on averaging the predicted efficient score functions in the total sample. Standard errors are based on asymptotic approximations using the estimated variance of the (estimated) efficient score functions.

Value

A medDML object contains two components, results and ntrimmed:

results: a 3X6 matrix containing the effect estimates in the first row ("effects"), standard errors in the second row ("se"), and p-values in the third row ("p-value"). The first column provides the total effect, namely the average treatment effect (ATE). The second and third columns provide the direct effects under treatment and control, respectively ("dir.treat", "dir.control"). The fourth and fifth columns provide the indirect effects under treatment and control, respectively ("indir.treat", "indir.control"). The sixth column provides the estimated mean under non-treatment ("Y(0,M(0))").

ntrimmed: number of discarded (trimmed) observations due to extreme conditional probabilities.

References

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., Robins, J. (2018): "Double/debiased machine learning for treatment and structural parameters", The Econometrics Journal, 21, C1-C68.

Farbmacher, H., Huber, M., Laffers, L., Langen, H., and Spindler, M. (2019): "Causal mediation analysis with double machine learning", working paper, University of Fribourg.

Tchetgen Tchetgen, E. J., and Shpitser, I. (2012): "Semiparametric theory for causal mediation analysis: efficiency bounds, multiple robustness, and sensitivity analysis", The Annals of Statistics, 40, 1816-1845.

Tibshirani, R. (1996): "Regression shrinkage and selection via the lasso", Journal of the Royal Statistical Society: Series B, 58, 267-288.

Examples

# A little example with simulated data (10000 observations)
## Not run: 
n=10000                           # sample size
p=100                             # number of covariates
s=2                               # number of covariates that are confounders
x=matrix(rnorm(n*p),ncol=p)       # covariate matrix
beta=c(rep(0.25,s), rep(0,p-s))   # coefficients determining degree of confounding
d=(x%*%beta+rnorm(n)>0)*1         # treatment equation
m=(x%*%beta+0.5*d+rnorm(n)>0)*1   # mediator equation
y=x%*%beta+0.5*d+m+rnorm(n)       # outcome equation
# The true direct effects are equal to 0.5, the indirect effects equal to 0.19
output=medDML(y=y,d=d,m=m,x=x)
round(output$results,3)
output$ntrimmed
## End(Not run)

[Package causalweight version 1.1.1 Index]