GenericML {GenericML} | R Documentation |
Generic Machine Learning Inference
Description
Performs generic machine learning inference on heterogeneous treatment effects as in Chernozhukov, Demirer, Duflo and Fernández-Val (2020) with user-specified machine learning methods. Intended for randomized experiments.
Usage
GenericML(
Z,
D,
Y,
learners_GenericML,
learner_propensity_score = "constant",
num_splits = 100,
Z_CLAN = NULL,
HT = FALSE,
quantile_cutoffs = c(0.25, 0.5, 0.75),
X1_BLP = setup_X1(),
X1_GATES = setup_X1(),
diff_GATES = setup_diff(),
diff_CLAN = setup_diff(),
vcov_BLP = setup_vcov(),
vcov_GATES = setup_vcov(),
equal_variances_CLAN = FALSE,
prop_aux = 0.5,
stratify = setup_stratify(),
significance_level = 0.05,
min_variation = 1e-05,
parallel = FALSE,
num_cores = parallel::detectCores(),
seed = NULL,
store_learners = FALSE,
store_splits = TRUE
)
Arguments
Z |
A numeric design matrix that holds the covariates in its columns. |
D |
A binary vector of treatment assignment. Value one denotes assignment to the treatment group and value zero assignment to the control group. |
Y |
A numeric vector containing the response variable. |
learners_GenericML |
A character vector specifying the machine learners to be used for estimating the baseline conditional average (BCA) and conditional average treatment effect (CATE). Either |
learner_propensity_score |
The estimator of the propensity scores. Either a numeric vector (which is then taken as estimates of the propensity scores) or a string specifying the estimator. In the latter case, the string must either be equal to |
num_splits |
Number of sample splits. Default is 100. Must be larger than one. If you want to run |
Z_CLAN |
A numeric matrix holding variables on which classification analysis (CLAN) shall be performed. CLAN will be performed on each column of the matrix. If |
HT |
Logical. If |
quantile_cutoffs |
The cutoff points of the quantiles that shall be used for GATES grouping. Default is |
X1_BLP |
Specifies the design matrix |
X1_GATES |
Same as |
diff_GATES |
Specifies the generic targets of GATES. Must be an object of class |
diff_CLAN |
Same as |
vcov_BLP |
Specifies the covariance matrix estimator in the BLP regression. Must be an object of class |
vcov_GATES |
Same as |
equal_variances_CLAN |
Logical. If |
prop_aux |
Proportion of samples that shall be in the auxiliary set in case of random sample splitting. Default is 0.5. The number of samples in the auxiliary set will be equal to |
stratify |
A list that specifies whether or not stratified sample splitting shall be performed. It is recommended to use the returned object of |
significance_level |
Significance level for VEIN. Default is 0.05. |
min_variation |
Specifies a threshold for the minimum variation of the BCA/CATE predictions. If the variation of a BCA/CATE prediction falls below this threshold, random noise with distribution |
parallel |
Logical. If |
num_cores |
Number of cores to be used in parallelization (if applicable). Default is the number of cores of the user's machine. |
seed |
Random seed. Default is |
store_learners |
Logical. If |
store_splits |
Logical. If |
Details
The specifications "lasso"
, "random_forest"
, and "tree"
in learners_GenericML
and learner_propensity_score
correspond to the following mlr3
specifications (we omit the keywords classif.
and regr.
). "lasso"
is a cross-validated Lasso estimator, which corresponds to 'mlr3::lrn("cv_glmnet", s = "lambda.min", alpha = 1)'
. "random_forest"
is a random forest with 500 trees, which corresponds to 'mlr3::lrn("ranger", num.trees = 500)'
. "tree"
is a tree learner, which corresponds to 'mlr3::lrn("rpart")'
. Warning: GenericML()
can be quite memory-intensive, in particular when the data set is large. To alleviate memory usage, consider setting store_learners = FALSE
, choosing a low number of cores via num_cores
(at the expense of longer computing time), setting prop_aux
to a value smaller than the default of 0.5, or using GenericML_combine()
.
Value
An object of class "GenericML"
. On this object, we recommend to use the accessor functions get_BLP()
, get_GATES()
, and get_CLAN()
to extract the results of the analyses of BLP, GATES, and CLAN, respectively. An object of class "GenericML"
contains the following components:
VEIN
A list containing two sub-lists called
best_learners
andall_learners
, respectively. Each of these two sub-lists contains the inferential VEIN results on the generic targets of the BLP, GATES, and CLAN analyses.all_learners
does this for all learners specified in the argumentlearners_GenericML
,best_learners
only for the corresponding best learners. Which learner is best for which analysis is assessed by the\Lambda
criteria discussed in Sections 5.2 and 5.3 of the paper.best
A list containing information on the evaluation of which learner is the best for which analysis. Contains four components. The first three contain the name of the best learner for BLP, GATES, and CLAN, respectively. The fourth component,
overview
, contains the two\Lambda
criteria used to determine the best learners (discussed in Sections 5.2 and 5.3 of the paper).propensity_scores
The propensity score estimates as well as the
"mlr3"
objects used to estimate them (ifmlr3
was used for estimation).GenericML_single
Only nonempty if
store_learners = TRUE
. Contains all intermediate results of each learners for each split. That is, for a given learner (first level of the list) and split (second level), objects of classes"BLP"
,"GATES"
,"CLAN"
,"proxy_BCA"
,"proxy_CATE"
as well as the\Lambda
criteria ("best"
)) are listed, which were computed with the given learner and split.splits
Only nonempty if
store_splits = TRUE
. Contains a character matrix of dimensionlength(Y)
bynum_splits
. Contains the group membership (main or auxiliary) of each observation (rows) in each split (columns)."M"
denotes the main set,"A"
the auxiliary set.generic_targets
A list of generic target estimates for each learner. More specifically, each component is a list of the generic target estimates pertaining to the BLP, GATES, and CLAN analyses. Each of those lists contains a three-dimensional array containing the generic targets of a single learner for all sample splits (except CLAN where there is one more layer of lists).
arguments
A list of arguments used in the function call.
Note
In an earlier development version, Lucas Kitzmueller alerted us to several minor bugs and proposed fixes. Many thanks to him!
References
Chernozhukov V., Demirer M., Duflo E., Fernández-Val I. (2020). “Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments.” arXiv preprint arXiv:1712.04802. URL: https://arxiv.org/abs/1712.04802.
Lang M., Binder M., Richter J., Schratz P., Pfisterer F., Coors S., Au Q., Casalicchio G., Kotthoff L., Bischl B. (2019). “mlr3: A Modern Object-Oriented Machine Learning Framework in R.” Journal of Open Source Software, 4(44), 1903. doi: 10.21105/joss.01903.
See Also
plot.GenericML()
print.GenericML()
get_BLP()
,
get_GATES()
,
get_CLAN()
,
setup_X1()
,
setup_diff()
,
setup_vcov()
,
setup_stratify()
,
GenericML_single()
,
GenericML_combine()
Examples
if (require("glmnet") && require("ranger")) {
## generate data
set.seed(1)
n <- 150 # number of observations
p <- 5 # number of covariates
D <- rbinom(n, 1, 0.5) # random treatment assignment
Z <- matrix(runif(n*p), n, p) # design matrix
Y0 <- as.numeric(Z %*% rexp(p) + rnorm(n)) # potential outcome without treatment
Y1 <- 2 + Y0 # potential outcome under treatment
Y <- ifelse(D == 1, Y1, Y0) # observed outcome
## column names of Z
colnames(Z) <- paste0("V", 1:p)
## specify learners
learners <- c("lasso", "mlr3::lrn('ranger', num.trees = 10)")
## glmnet v4.1.3 isn't supported on Solaris, so skip Lasso in this case
if(Sys.info()["sysname"] == "SunOS") learners <- learners[-1]
## specify quantile cutoffs (the 4 quartile groups here)
quantile_cutoffs <- c(0.25, 0.5, 0.75)
## specify the differenced generic targets of GATES and CLAN
# use G4-G1, G4-G2, G4-G3 as differenced generic targets in GATES
diff_GATES <- setup_diff(subtract_from = "most",
subtracted = c(1,2,3))
# use G1-G3, G1-G2 as differenced generic targets in CLAN
diff_CLAN <- setup_diff(subtract_from = "least",
subtracted = c(3,2))
## perform generic ML inference
# small number of splits to keep computation time low
x <- GenericML(Z, D, Y, learners, num_splits = 2,
quantile_cutoffs = quantile_cutoffs,
diff_GATES = diff_GATES,
diff_CLAN = diff_CLAN,
parallel = FALSE)
## access BLP generic targets for best learner and make plot
get_BLP(x, plot = TRUE)
## access GATES generic targets for best learner and make plot
get_GATES(x, plot = TRUE)
## access CLAN generic targets for "V1" & best learner and make plot
get_CLAN(x, variable = "V1", plot = TRUE)
}