R: Generic Machine Learning Inference

GenericML {GenericML}

R Documentation

Generic Machine Learning Inference

Description

Performs generic machine learning inference on heterogeneous treatment effects as in Chernozhukov, Demirer, Duflo and Fernández-Val (2020) with user-specified machine learning methods. Intended for randomized experiments.

Usage

GenericML(
  Z,
  D,
  Y,
  learners_GenericML,
  learner_propensity_score = "constant",
  num_splits = 100,
  Z_CLAN = NULL,
  HT = FALSE,
  quantile_cutoffs = c(0.25, 0.5, 0.75),
  X1_BLP = setup_X1(),
  X1_GATES = setup_X1(),
  diff_GATES = setup_diff(),
  diff_CLAN = setup_diff(),
  vcov_BLP = setup_vcov(),
  vcov_GATES = setup_vcov(),
  equal_variances_CLAN = FALSE,
  prop_aux = 0.5,
  stratify = setup_stratify(),
  significance_level = 0.05,
  min_variation = 1e-05,
  parallel = FALSE,
  num_cores = parallel::detectCores(),
  seed = NULL,
  store_learners = FALSE,
  store_splits = TRUE
)

Arguments

`Z`	A numeric design matrix that holds the covariates in its columns.
`D`	A binary vector of treatment assignment. Value one denotes assignment to the treatment group and value zero assignment to the control group.
`Y`	A numeric vector containing the response variable.
`learners_GenericML`	A character vector specifying the machine learners to be used for estimating the baseline conditional average (BCA) and conditional average treatment effect (CATE). Either `'lasso'`, `'random_forest'`, `'tree'`, or a custom learner specified with `mlr3` syntax. In the latter case, do not specify in the `mlr3` syntax specification if the learner is a regression learner or classification learner. Example: `'mlr3::lrn("ranger", num.trees = 100)'` for a random forest learner with 100 trees. Note that this is a string and the absence of the `classif.` or `regr.` keywords. See https://mlr3learners.mlr-org.com for a list of `mlr3` learners.
`learner_propensity_score`	The estimator of the propensity scores. Either a numeric vector (which is then taken as estimates of the propensity scores) or a string specifying the estimator. In the latter case, the string must either be equal to `'constant'` (estimates the propensity scores by `mean(D)`), `'lasso'`, `'random_forest'`, `'tree'`, or `mlr3` syntax. Note that in case of `mlr3` syntax, do not specify if the learner is a regression learner or classification learner. Example: `'mlr3::lrn("ranger", num.trees = 100)'` for a random forest learner with 100 trees. Note that this is a string and the absence of the `classif.` or `regr.` keywords. See https://mlr3learners.mlr-org.com for a list of `mlr3` learners.
`num_splits`	Number of sample splits. Default is 100. Must be larger than one. If you want to run `GenericML` on a single split, please use `GenericML_single()`.
`Z_CLAN`	A numeric matrix holding variables on which classification analysis (CLAN) shall be performed. CLAN will be performed on each column of the matrix. If `NULL` (default), then `Z_CLAN = Z`, i.e. CLAN is performed for all variables in `Z`.
`HT`	Logical. If `TRUE`, a Horvitz-Thompson (HT) transformation is applied in the BLP and GATES regressions. Default is `FALSE`.
`quantile_cutoffs`	The cutoff points of the quantiles that shall be used for GATES grouping. Default is `c(0.25, 0.5, 0.75)`, which corresponds to the four quartiles.
`X1_BLP`	Specifies the design matrix `X_1` in the regression. Must be an object of class `"setup_X1"`. See the documentation of `setup_X1()` for details.
`X1_GATES`	Same as `X1_BLP`, just for the GATES regression.
`diff_GATES`	Specifies the generic targets of GATES. Must be an object of class `"setup_diff"`. See the documentation of `setup_diff()` for details.
`diff_CLAN`	Same as `diff_GATES`, just for the CLAN generic targets.
`vcov_BLP`	Specifies the covariance matrix estimator in the BLP regression. Must be an object of class `"setup_vcov"`. See the documentation of `setup_vcov()` for details.
`vcov_GATES`	Same as `vcov_BLP`, just for the GATES regression.
`equal_variances_CLAN`	Logical. If `TRUE`, then all within-group variances of the CLAN groups are assumed to be equal. Default is `FALSE`. This specification is required for heteroskedasticity-robust variance estimation on the difference of two CLAN generic targets (i.e. variance of the difference of two means). If `TRUE` (corresponds to homoskedasticity assumption), the pooled variance is used. If `FALSE` (heteroskedasticity), the variance of Welch's t-test is used.
`prop_aux`	Proportion of samples that shall be in the auxiliary set in case of random sample splitting. Default is 0.5. The number of samples in the auxiliary set will be equal to `floor(prop_aux * length(Y))`. If the data set is large, you can save computing time by choosing `prop_aux` to be smaller than 0.5. In case of stratified sampling (controlled through the argument `stratify` via `setup_stratify()`), `prop_aux` does not have an effect, and the number of samples in the auxiliary set is specified via `setup_stratify()`.
`stratify`	A list that specifies whether or not stratified sample splitting shall be performed. It is recommended to use the returned object of `setup_stratify()` as this list. See the documentation of `setup_stratify()` for details.
`significance_level`	Significance level for VEIN. Default is 0.05.
`min_variation`	Specifies a threshold for the minimum variation of the BCA/CATE predictions. If the variation of a BCA/CATE prediction falls below this threshold, random noise with distribution `N(0, var(Y)/20)` is added to it. Default is `1e-05`.
`parallel`	Logical. If `TRUE`, parallel computing will be used. Default is `FALSE`. On Unix systems, this will be done via forking (shared memory across threads). On non-Unix systems, this will be done through parallel socket clusters.
`num_cores`	Number of cores to be used in parallelization (if applicable). Default is the number of cores of the user's machine.
`seed`	Random seed. Default is `NULL` for no random seeding.
`store_learners`	Logical. If `TRUE`, all intermediate results of the learners will be stored. That is, for each learner and each split, all BCA and CATE predictions as well as all BLP, GATES, CLAN, and `\Lambda` estimates will be stored. Default is `FALSE`.
`store_splits`	Logical. If `TRUE` (default), the sample splits will be stored.

Details

The specifications "lasso", "random_forest", and "tree" in learners_GenericML and learner_propensity_score correspond to the following mlr3 specifications (we omit the keywords classif. and regr.). "lasso" is a cross-validated Lasso estimator, which corresponds to 'mlr3::lrn("cv_glmnet", s = "lambda.min", alpha = 1)'. "random_forest" is a random forest with 500 trees, which corresponds to 'mlr3::lrn("ranger", num.trees = 500)'. "tree" is a tree learner, which corresponds to 'mlr3::lrn("rpart")'. Warning: GenericML() can be quite memory-intensive, in particular when the data set is large. To alleviate memory usage, consider setting store_learners = FALSE, choosing a low number of cores via num_cores (at the expense of longer computing time), setting prop_aux to a value smaller than the default of 0.5, or using GenericML_combine().

Value

An object of class "GenericML". On this object, we recommend to use the accessor functions get_BLP(), get_GATES(), and get_CLAN() to extract the results of the analyses of BLP, GATES, and CLAN, respectively. An object of class "GenericML" contains the following components:

VEIN: A list containing two sub-lists called best_learners and all_learners, respectively. Each of these two sub-lists contains the inferential VEIN results on the generic targets of the BLP, GATES, and CLAN analyses. all_learners does this for all learners specified in the argument learners_GenericML, best_learners only for the corresponding best learners. Which learner is best for which analysis is assessed by the \Lambda criteria discussed in Sections 5.2 and 5.3 of the paper.
best: A list containing information on the evaluation of which learner is the best for which analysis. Contains four components. The first three contain the name of the best learner for BLP, GATES, and CLAN, respectively. The fourth component, overview, contains the two \Lambda criteria used to determine the best learners (discussed in Sections 5.2 and 5.3 of the paper).
propensity_scores: The propensity score estimates as well as the "mlr3" objects used to estimate them (if mlr3 was used for estimation).
GenericML_single: Only nonempty if store_learners = TRUE. Contains all intermediate results of each learners for each split. That is, for a given learner (first level of the list) and split (second level), objects of classes "BLP", "GATES", "CLAN", "proxy_BCA", "proxy_CATE" as well as the \Lambda criteria ("best")) are listed, which were computed with the given learner and split.
splits: Only nonempty if store_splits = TRUE. Contains a character matrix of dimension length(Y) by num_splits. Contains the group membership (main or auxiliary) of each observation (rows) in each split (columns). "M" denotes the main set, "A" the auxiliary set.
generic_targets: A list of generic target estimates for each learner. More specifically, each component is a list of the generic target estimates pertaining to the BLP, GATES, and CLAN analyses. Each of those lists contains a three-dimensional array containing the generic targets of a single learner for all sample splits (except CLAN where there is one more layer of lists).
arguments: A list of arguments used in the function call.

Note

In an earlier development version, Lucas Kitzmueller alerted us to several minor bugs and proposed fixes. Many thanks to him!

References

Chernozhukov V., Demirer M., Duflo E., Fernández-Val I. (2020). “Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments.” arXiv preprint arXiv:1712.04802. URL: https://arxiv.org/abs/1712.04802.

Lang M., Binder M., Richter J., Schratz P., Pfisterer F., Coors S., Au Q., Casalicchio G., Kotthoff L., Bischl B. (2019). “mlr3: A Modern Object-Oriented Machine Learning Framework in R.” Journal of Open Source Software, 4(44), 1903. doi: 10.21105/joss.01903.

Examples

if (require("glmnet") && require("ranger")) {

## generate data
set.seed(1)
n  <- 150                                  # number of observations
p  <- 5                                    # number of covariates
D  <- rbinom(n, 1, 0.5)                    # random treatment assignment
Z  <- matrix(runif(n*p), n, p)             # design matrix
Y0 <- as.numeric(Z %*% rexp(p) + rnorm(n)) # potential outcome without treatment
Y1 <- 2 + Y0                               # potential outcome under treatment
Y  <- ifelse(D == 1, Y1, Y0)               # observed outcome

## column names of Z
colnames(Z) <- paste0("V", 1:p)

## specify learners
learners <- c("lasso", "mlr3::lrn('ranger', num.trees = 10)")

## glmnet v4.1.3 isn't supported on Solaris, so skip Lasso in this case
if(Sys.info()["sysname"] == "SunOS") learners <- learners[-1]

## specify quantile cutoffs (the 4 quartile groups here)
quantile_cutoffs <- c(0.25, 0.5, 0.75)

## specify the differenced generic targets of GATES and CLAN
# use G4-G1, G4-G2, G4-G3 as differenced generic targets in GATES
diff_GATES <- setup_diff(subtract_from = "most",
                        subtracted = c(1,2,3))
# use G1-G3, G1-G2 as differenced generic targets in CLAN
diff_CLAN  <- setup_diff(subtract_from = "least",
                         subtracted = c(3,2))

## perform generic ML inference
# small number of splits to keep computation time low
x <- GenericML(Z, D, Y, learners, num_splits = 2,
               quantile_cutoffs = quantile_cutoffs,
               diff_GATES = diff_GATES,
               diff_CLAN = diff_CLAN,
               parallel = FALSE)

## access BLP generic targets for best learner and make plot
get_BLP(x, plot = TRUE)

## access GATES generic targets for best learner and make plot
get_GATES(x, plot = TRUE)

## access CLAN generic targets for "V1" & best learner and make plot
get_CLAN(x, variable = "V1", plot = TRUE)

}

[Package GenericML version 0.2.2 Index]