GenericML {GenericML}R Documentation

Generic Machine Learning Inference

Description

Performs generic machine learning inference on heterogeneous treatment effects as in Chernozhukov, Demirer, Duflo and Fernández-Val (2020) with user-specified machine learning methods. Intended for randomized experiments.

Usage

GenericML(
  Z,
  D,
  Y,
  learners_GenericML,
  learner_propensity_score = "constant",
  num_splits = 100,
  Z_CLAN = NULL,
  HT = FALSE,
  quantile_cutoffs = c(0.25, 0.5, 0.75),
  X1_BLP = setup_X1(),
  X1_GATES = setup_X1(),
  diff_GATES = setup_diff(),
  diff_CLAN = setup_diff(),
  vcov_BLP = setup_vcov(),
  vcov_GATES = setup_vcov(),
  equal_variances_CLAN = FALSE,
  prop_aux = 0.5,
  stratify = setup_stratify(),
  significance_level = 0.05,
  min_variation = 1e-05,
  parallel = FALSE,
  num_cores = parallel::detectCores(),
  seed = NULL,
  store_learners = FALSE,
  store_splits = TRUE
)

Arguments

Z

A numeric design matrix that holds the covariates in its columns.

D

A binary vector of treatment assignment. Value one denotes assignment to the treatment group and value zero assignment to the control group.

Y

A numeric vector containing the response variable.

learners_GenericML

A character vector specifying the machine learners to be used for estimating the baseline conditional average (BCA) and conditional average treatment effect (CATE). Either 'lasso', 'random_forest', 'tree', or a custom learner specified with mlr3 syntax. In the latter case, do not specify in the mlr3 syntax specification if the learner is a regression learner or classification learner. Example: 'mlr3::lrn("ranger", num.trees = 100)' for a random forest learner with 100 trees. Note that this is a string and the absence of the classif. or regr. keywords. See https://mlr3learners.mlr-org.com for a list of mlr3 learners.

learner_propensity_score

The estimator of the propensity scores. Either a numeric vector (which is then taken as estimates of the propensity scores) or a string specifying the estimator. In the latter case, the string must either be equal to 'constant' (estimates the propensity scores by mean(D)), 'lasso', 'random_forest', 'tree', or mlr3 syntax. Note that in case of mlr3 syntax, do not specify if the learner is a regression learner or classification learner. Example: 'mlr3::lrn("ranger", num.trees = 100)' for a random forest learner with 100 trees. Note that this is a string and the absence of the classif. or regr. keywords. See https://mlr3learners.mlr-org.com for a list of mlr3 learners.

num_splits

Number of sample splits. Default is 100. Must be larger than one. If you want to run GenericML on a single split, please use GenericML_single().

Z_CLAN

A numeric matrix holding variables on which classification analysis (CLAN) shall be performed. CLAN will be performed on each column of the matrix. If NULL (default), then Z_CLAN = Z, i.e. CLAN is performed for all variables in Z.

HT

Logical. If TRUE, a Horvitz-Thompson (HT) transformation is applied in the BLP and GATES regressions. Default is FALSE.

quantile_cutoffs

The cutoff points of the quantiles that shall be used for GATES grouping. Default is c(0.25, 0.5, 0.75), which corresponds to the four quartiles.

X1_BLP

Specifies the design matrix X_1 in the regression. Must be an object of class "setup_X1". See the documentation of setup_X1() for details.

X1_GATES

Same as X1_BLP, just for the GATES regression.

diff_GATES

Specifies the generic targets of GATES. Must be an object of class "setup_diff". See the documentation of setup_diff() for details.

diff_CLAN

Same as diff_GATES, just for the CLAN generic targets.

vcov_BLP

Specifies the covariance matrix estimator in the BLP regression. Must be an object of class "setup_vcov". See the documentation of setup_vcov() for details.

vcov_GATES

Same as vcov_BLP, just for the GATES regression.

equal_variances_CLAN

Logical. If TRUE, then all within-group variances of the CLAN groups are assumed to be equal. Default is FALSE. This specification is required for heteroskedasticity-robust variance estimation on the difference of two CLAN generic targets (i.e. variance of the difference of two means). If TRUE (corresponds to homoskedasticity assumption), the pooled variance is used. If FALSE (heteroskedasticity), the variance of Welch's t-test is used.

prop_aux

Proportion of samples that shall be in the auxiliary set in case of random sample splitting. Default is 0.5. The number of samples in the auxiliary set will be equal to floor(prop_aux * length(Y)). If the data set is large, you can save computing time by choosing prop_aux to be smaller than 0.5. In case of stratified sampling (controlled through the argument stratify via setup_stratify()), prop_aux does not have an effect, and the number of samples in the auxiliary set is specified via setup_stratify().

stratify

A list that specifies whether or not stratified sample splitting shall be performed. It is recommended to use the returned object of setup_stratify() as this list. See the documentation of setup_stratify() for details.

significance_level

Significance level for VEIN. Default is 0.05.

min_variation

Specifies a threshold for the minimum variation of the BCA/CATE predictions. If the variation of a BCA/CATE prediction falls below this threshold, random noise with distribution N(0, var(Y)/20) is added to it. Default is 1e-05.

parallel

Logical. If TRUE, parallel computing will be used. Default is FALSE. On Unix systems, this will be done via forking (shared memory across threads). On non-Unix systems, this will be done through parallel socket clusters.

num_cores

Number of cores to be used in parallelization (if applicable). Default is the number of cores of the user's machine.

seed

Random seed. Default is NULL for no random seeding.

store_learners

Logical. If TRUE, all intermediate results of the learners will be stored. That is, for each learner and each split, all BCA and CATE predictions as well as all BLP, GATES, CLAN, and \Lambda estimates will be stored. Default is FALSE.

store_splits

Logical. If TRUE (default), the sample splits will be stored.

Details

The specifications "lasso", "random_forest", and "tree" in learners_GenericML and learner_propensity_score correspond to the following mlr3 specifications (we omit the keywords classif. and regr.). "lasso" is a cross-validated Lasso estimator, which corresponds to 'mlr3::lrn("cv_glmnet", s = "lambda.min", alpha = 1)'. "random_forest" is a random forest with 500 trees, which corresponds to 'mlr3::lrn("ranger", num.trees = 500)'. "tree" is a tree learner, which corresponds to 'mlr3::lrn("rpart")'. Warning: GenericML() can be quite memory-intensive, in particular when the data set is large. To alleviate memory usage, consider setting store_learners = FALSE, choosing a low number of cores via num_cores (at the expense of longer computing time), setting prop_aux to a value smaller than the default of 0.5, or using GenericML_combine().

Value

An object of class "GenericML". On this object, we recommend to use the accessor functions get_BLP(), get_GATES(), and get_CLAN() to extract the results of the analyses of BLP, GATES, and CLAN, respectively. An object of class "GenericML" contains the following components:

VEIN

A list containing two sub-lists called best_learners and all_learners, respectively. Each of these two sub-lists contains the inferential VEIN results on the generic targets of the BLP, GATES, and CLAN analyses. all_learners does this for all learners specified in the argument learners_GenericML, best_learners only for the corresponding best learners. Which learner is best for which analysis is assessed by the \Lambda criteria discussed in Sections 5.2 and 5.3 of the paper.

best

A list containing information on the evaluation of which learner is the best for which analysis. Contains four components. The first three contain the name of the best learner for BLP, GATES, and CLAN, respectively. The fourth component, overview, contains the two \Lambda criteria used to determine the best learners (discussed in Sections 5.2 and 5.3 of the paper).

propensity_scores

The propensity score estimates as well as the "mlr3" objects used to estimate them (if mlr3 was used for estimation).

GenericML_single

Only nonempty if store_learners = TRUE. Contains all intermediate results of each learners for each split. That is, for a given learner (first level of the list) and split (second level), objects of classes "BLP", "GATES", "CLAN", "proxy_BCA", "proxy_CATE" as well as the \Lambda criteria ("best")) are listed, which were computed with the given learner and split.

splits

Only nonempty if store_splits = TRUE. Contains a character matrix of dimension length(Y) by num_splits. Contains the group membership (main or auxiliary) of each observation (rows) in each split (columns). "M" denotes the main set, "A" the auxiliary set.

generic_targets

A list of generic target estimates for each learner. More specifically, each component is a list of the generic target estimates pertaining to the BLP, GATES, and CLAN analyses. Each of those lists contains a three-dimensional array containing the generic targets of a single learner for all sample splits (except CLAN where there is one more layer of lists).

arguments

A list of arguments used in the function call.

Note

In an earlier development version, Lucas Kitzmueller alerted us to several minor bugs and proposed fixes. Many thanks to him!

References

Chernozhukov V., Demirer M., Duflo E., Fernández-Val I. (2020). “Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments.” arXiv preprint arXiv:1712.04802. URL: https://arxiv.org/abs/1712.04802.

Lang M., Binder M., Richter J., Schratz P., Pfisterer F., Coors S., Au Q., Casalicchio G., Kotthoff L., Bischl B. (2019). “mlr3: A Modern Object-Oriented Machine Learning Framework in R.” Journal of Open Source Software, 4(44), 1903. doi: 10.21105/joss.01903.

See Also

plot.GenericML() print.GenericML() get_BLP(), get_GATES(), get_CLAN(), setup_X1(), setup_diff(), setup_vcov(), setup_stratify(), GenericML_single(), GenericML_combine()

Examples

if (require("glmnet") && require("ranger")) {

## generate data
set.seed(1)
n  <- 150                                  # number of observations
p  <- 5                                    # number of covariates
D  <- rbinom(n, 1, 0.5)                    # random treatment assignment
Z  <- matrix(runif(n*p), n, p)             # design matrix
Y0 <- as.numeric(Z %*% rexp(p) + rnorm(n)) # potential outcome without treatment
Y1 <- 2 + Y0                               # potential outcome under treatment
Y  <- ifelse(D == 1, Y1, Y0)               # observed outcome

## column names of Z
colnames(Z) <- paste0("V", 1:p)

## specify learners
learners <- c("lasso", "mlr3::lrn('ranger', num.trees = 10)")

## glmnet v4.1.3 isn't supported on Solaris, so skip Lasso in this case
if(Sys.info()["sysname"] == "SunOS") learners <- learners[-1]

## specify quantile cutoffs (the 4 quartile groups here)
quantile_cutoffs <- c(0.25, 0.5, 0.75)

## specify the differenced generic targets of GATES and CLAN
# use G4-G1, G4-G2, G4-G3 as differenced generic targets in GATES
diff_GATES <- setup_diff(subtract_from = "most",
                        subtracted = c(1,2,3))
# use G1-G3, G1-G2 as differenced generic targets in CLAN
diff_CLAN  <- setup_diff(subtract_from = "least",
                         subtracted = c(3,2))

## perform generic ML inference
# small number of splits to keep computation time low
x <- GenericML(Z, D, Y, learners, num_splits = 2,
               quantile_cutoffs = quantile_cutoffs,
               diff_GATES = diff_GATES,
               diff_CLAN = diff_CLAN,
               parallel = FALSE)

## access BLP generic targets for best learner and make plot
get_BLP(x, plot = TRUE)

## access GATES generic targets for best learner and make plot
get_GATES(x, plot = TRUE)

## access CLAN generic targets for "V1" & best learner and make plot
get_CLAN(x, variable = "V1", plot = TRUE)

}


[Package GenericML version 0.2.2 Index]