tsci_forest {TSCI} | R Documentation |
Two Stage Curvature Identification with Random Forests
Description
tsci_forest
implements Two Stage Curvature Identification
(Guo and Buehlmann 2022) with random forests. Through a data-dependent way, it
tests for the smallest sufficiently large violation space among a pre-specified
sequence of nested violation space candidates. Point and uncertainty estimates
of the treatment effect for all violation space candidates including the
selected violation space will be returned amongst other relevant statistics.
Usage
tsci_forest(
Y,
D,
Z,
X = NULL,
W = X,
vio_space,
create_nested_sequence = TRUE,
sel_method = c("comparison", "conservative"),
split_prop = 2/3,
num_trees = 200,
mtry = NULL,
max_depth = 0,
min_node_size = c(5, 10, 20),
self_predict = FALSE,
sd_boot = TRUE,
iv_threshold = 10,
threshold_boot = TRUE,
alpha = 0.05,
nsplits = 10,
mult_split_method = c("FWER", "DML"),
intercept = TRUE,
parallel = c("no", "multicore", "snow"),
ncores = 1,
cl = NULL,
raw_output = NULL,
B = 300
)
Arguments
Y |
observations of the outcome variable. Either a numeric vector of length n or a numeric matrix with dimension n by 1. If outcome variable is binary use dummy encoding. |
D |
observations of the treatment variable. Either a numeric vector of length n or a numeric matrix with dimension n by 1. If treatment variable is binary use dummy encoding. |
Z |
observations of the instrumental variable(s). Either a vector of length n or a matrix with dimension n by s. If observations are not numeric dummy encoding will be applied. |
X |
observations of baseline covariate(s). Either a vector of length n
or a matrix with dimension n by p or |
W |
(transformed) observations of baseline covariate(s) used to fit the outcome model. Either a vector of length n
or a matrix with dimension n by p_w or |
vio_space |
list with vectors of length n and/or matrices with n rows as elements to specify the violation space candidates. If observations are not numeric dummy encoding will be applied. See Details for more information. |
create_nested_sequence |
logical. If |
sel_method |
the selection method used to estimate the treatment effect. Either "comparison" or "conservative". See Details. |
split_prop |
proportion of observations used to fit the outcome model. Has to be a numeric value in (0, 1). |
num_trees |
number of trees in random forests. Can either be a single integer value or a vector of integer values to try. |
mtry |
number of covariates to possibly split at in each node of the tree of the random forest. Can either be a single integer value or a vector of integer values to try. Can also be a list of single argument function(s) returning an integer value, given the number of independent variables. The values have to be positive integers not larger than the number of independent variables in the treatment model. Default is to try all integer values between one-third of the independent variables and two-thirds of the independent variables. |
max_depth |
maximal tree depth in random forests. Can either be a single integer value or a vector of integer values to try. 0 correspond to unlimited depth. |
min_node_size |
minimal size of each leaf node in the random forest. Can either be a single integer value or a vector of integer values to try. |
self_predict |
logical. If |
sd_boot |
logical. if |
iv_threshold |
a numeric value specifying the minimum of the threshold of IV strength test. |
threshold_boot |
logical. If |
alpha |
the significance level. Has to be a numeric value between 0 and 1. |
nsplits |
number of times the data will be split. Has to be an integer larger or equal 1. See Details. |
mult_split_method |
method to calculate the standard errors, p-values and to construct the confidence intervals if multi-splitting is performed.
Default is "DML" if |
intercept |
logical. If |
parallel |
one out of |
ncores |
the number of cores to use. Has to be an integer value larger or equal 1. |
cl |
either a parallel or snow cluster or |
raw_output |
logical. If |
B |
number of bootstrap samples. Has to be a positive integer value.
Bootstrap methods are used to calculate the IV strength threshold if |
Details
The treatment and outcome models are assumed to be of the following forms:
D_i = f(Z_i, X_i) + \delta_i
Y_i = \beta \cdot D_i + h(Z_i, X_i) + \phi(X_i) + \epsilon_i
where f(Z_i, X_i)
is estimated using a random forest,
h(Z_i X_i)
is approximated using the violation space candidates and \phi(X_i)
is approximated by
a linear combination of the columns in W
. The errors are allowed to be heteroscedastic.
To avoid overfitting bias the data is randomly split into two subsets A1
and A2
where the proportion of observations in the two sets is specified by split_prop
.
A2
is used to train the random forest and A1
is used to perform violation space selection
and to estimate the treatment effect.
The package ranger
is used to fit the random forest. If any of num_trees
,
max_depth
or min_node_size
has more than one value,
the best parameter combination is chosen by minimizing the out-of-bag mean squared error.
The violation space candidates should be in a nested sequence as the violation space selection is performed
by comparing the treatment estimate obtained by each violation space candidate with the estimates of all
violation space candidates further down the list vio_space
that provide enough IV strength. Only if no
significant difference was found in all of those comparisons, the violation space
candidate will be selected. If sel_method
is 'comparison', the treatment effect estimate of this
violation space candidate will be returned. If sel_method
is 'conservative', the treatment effect estimate
of the successive violation space candidate will be returned provided that the IV strength is large enough.
The specification of suitable violation space candidates is a crucial step because a poor approximation
of g(Z_i, X_i)
might not address the bias caused by the violation of the IV assumption sufficiently well.
The function create_monomials
can be used to create a predefined sequence of
violation space candidates consisting of monomials.
The function create_interactions
can be used to create a predefined sequence of
violation space candidates consisting of two-way interactions between the instrumens themselves and between
the instruments and the instruments and baseline covariates.
The instrumental variable(s) are considered strong enough for violation space candidate V_q
if the estimated IV strength using this
violation space candidate is larger than the obtained value of the threshold of the IV strength.
The formula of the threshold of the IV strength has the form
\min \{\max \{ 2 \cdot \mathrm{Trace} [ \mathrm{M} (V_q) ], \mathrm{iv{\_}threshold} \} + S (V_q), 40 \}
if threshold_boot
is TRUE
, and
\min \{\max \{ 2 \cdot \mathrm{Trace} [ \mathrm{M} (V_q) ], \mathrm{iv{\_}threshold} \}, 40 \}
if threshold_boot
is FALSE
. The matrix
\mathrm{M} (V_q)
depends on the hat matrix obtained from estimating f(Z_i, X_i)
, the violation space candidate V_q
and
the variables to include in the outcome model W
. S (V_q)
is obtained using a bootstrap and aims to adjust for the estimation error
of the IV strength.
Usually, the value of the threshold of the IV strength obtained using the bootstrap approach is larger.
Thus, using threshold_boot
equals TRUE
leads to a more conservative IV strength test.
For more information see subsection 3.3 in Guo and Buehlmann (2022).
nsplits
specifies the number of data splits that should be performed.
For each data split the output statistics such as the point estimates of the
treatment effect are calculated. Those statistics will then be aggregated
over the different data splits. It is recommended to perform multiple data splits
as data splitting introduces additional randomness. By aggregating the results
of multiple data splits, the effects of this randomness can be decreased.
If nsplits
is larger than 1, point estimates are aggregated by medians.
Standard errors, p-values and confidence intervals are obtained by the method
specified by the parameter mult_split_method
. 'DML' uses the approach by
Chernozhukov et al. (2018). 'FWER' uses the approach by Meinshausen et al. (2009)
and controls for the family-wise error rate. 'FWER' does not provide standard errors.
For large sample sizes, a large values for nsplits
can lead to a high
running time as for each split a new hat matrix must be calculated.
There are three possibilities to set the argument parallel
, namely
"no"
for serial evaluation (default),
"multicore"
for parallel evaluation using forking,
and "snow"
for parallel evaluation using a parallel
socket cluster. It is recommended to select RNGkind
("L'Ecuyer-CMRG"
) and to set a seed to ensure that the parallel
computing of the package TSCI
is reproducible.
This ensures that each processor receives a different substream of the
pseudo random number generator stream.
Thus, the results are reproducible if the arguments (including ncores
)
remain unchanged.
There is an optional argument cl
to specify a custom cluster
if parallel = "snow"
.
Results obtained on different operating systems might differ even when the same
seed is set. The reason for this lies in the way the random forest algorithm in
ranger
is implemented. Currently, we are not aware of a solution to
ensure reproducibility across operating systems when using tsci_forest
.
However, tsci_boosting
, tsci_poly
and
tsci_secondstage
do not have this issue.
See also Carl et al. (2023) for more details.
Value
A list containing the following elements:
Coef_all
a series of point estimates of the treatment effect obtained by the different violation space candidates.
sd_all
standard errors of the estimates of the treatmnet effect obtained by the different violation space candidates.
pval_all
p-values of the treatment effect estimates obtained by the different violation space candidates.
CI_all
confidence intervals for the treatment effect obtained by the different violation space candidates.
Coef_sel
the point estimator of the treatment effect obtained by the selected violation space candidate(s).
sd_sel
the standard error of Coef_sel.
pval_sel
p-value of the treatment effect estimate obtained by the selected violation space candidate(s).
CI_sel
confidence interval for the treatment effect obtained by the selected violation space candidate(s).
iv_str
IV strength using the different violation space candidates.
iv_thol
the threshold for the IV strength using the different violation space candidates.
Qmax
the frequency each violation space candidate was the largest violation space candidate for which the IV strength was considered large enough determined by the IV strength test over the multiple data splits. If 0, the IV Strength test failed for the first violation space candidate. Otherwise, violation space selection was performed.
q_comp
the frequency each violation space candidate was selected by the comparison method over the multiple data splits.
q_cons
the frequency each violation space candidate was selected by the conservative method over the multiple data splits.
invalidity
the frequency the instrumental variable(s) were considered valid, invalid or too weak to test for violations. The instrumental variables are considered too weak to test for violations if the IV strength is already too weak using the first violation space candidate (besides the empty violation space). Testing for violations is always performed by using the comparison method.
mse
the out-of-sample mean squared error of the fitted treatment model.
FirstStage_model
the method used to fit the treatment model.
n_A1
number of observations in A1.
n_A2
number of observations in A2.
nsplits
number of data splits performed.
mult_split_method
the method used to calculate the standard errors and p-values.
alpha
the significance level used.
References
Zijian Guo, and Peter Buehlmann. Two Stage Curvature Identification with Machine Learning: Causal Inference with Possibly Invalid Instrumental Variables. arXiv:2203.12808, 2022
Nicolai Meinshausen, Lukas Meier, and Peter Buehlmann. P-values for high-dimensional regression. Journal of the American Statistical Association, 104(488):1671-1681, 2009. 16, 18
Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters: Double/debiased machine learning. The Econometrics Journal, 21(1), 2018. 4, 16, 18
David Carl, Corinne Emmenegger, Peter Buehlmann, and Zijian Guo. TSCI: two stage curvature identification for causal inference with invalid instruments. arXiv:2304.00513, 2023
See Also
tsci_boosting
for TSCI with boosting.
tsci_poly
for TSCI with polynomial basis expansion.
tsci_secondstage
for TSCI with user provided hat matrix.
Examples
### a small example without baseline covariates
if (require("MASS")) {
# sample size
n <- 100
# the IV strength
a <- 1
# the violation strength
tau <- 1
# true effect
beta <- 1
# treatment model
f <- function(x) {1 + a * (x + x^2)}
# outcome model
g <- function(x) {1 + tau * x}
# generate data
mu_error <- rep(0, 2)
Cov_error <- matrix(c(1, 0.5, 0.5, 1), 2, 2)
Error <- MASS::mvrnorm(n, mu_error, Cov_error)
# instrumental variable
Z <- rnorm(n)
# treatment variable
D <- f(Z) + Error[, 1]
# outcome variable
Y <- beta * D + g(Z) + Error[, 2]
# Two Stage Random Forest
# create violation space candidates
vio_space <- create_monomials(Z, 2, "monomials_main")
# perform two stage curvature identification
output_RF <- tsci_forest(Y, D, Z, vio_space = vio_space, nsplits = 1,
num_trees = 50, B = 100)
summary(output_RF)
}