generate_cre_dataset {CRE}  R Documentation 
Generate CRE synthetic data
Description
Generates synthetic data sets to run simulation for causal inference
experiments composed by an outcome vector (y
), a treatment vector (z
),
a covariates matrix (X
), and an unobserved individual treatment effects
vector (ite
).
The arguments specify the data set characteristic, including the
number of individuals (n
), the number of covariates (p
), the correlation
within the covariates (rho
), the number of decision rules
(n_rules
) decomposing the Conditional Average Treatment Effect (CATE), the
treatment effect magnitude (effect_size
), the confounding mechanism
(confounding
), and whether the covariates and outcomes are binary or
continuous (binary_covariates
, binary_outcome
).
Usage
generate_cre_dataset(
n = 1000,
rho = 0,
n_rules = 2,
p = 10,
effect_size = 2,
binary_covariates = TRUE,
binary_outcome = TRUE,
confounding = "no"
)
Arguments
n 
An integer number that represents the number of observations. Noninteger values will be converted into an integer number. 
rho 
A positive double number that represents the correlation within the covariates (default: 0, range: [0,1)). 
n_rules 
The number of causal rules (default: 2, range: {1,2,3,4}). 
p 
The number of covariates (default: 10). 
effect_size 
The treatment effect size magnitude (default: 2,
range: 
binary_covariates 
Whether to use binary or continuous covariates
(default: 
binary_outcome 
Whether to use binary or continuous outcomes
(default: 
confounding 
Only for continuous outcome, add confounding variables:

Details
The covariates matrix is generated with the specified correlation among
individuals, and each covariate is sampled either from a
Bernoulli(0.5)
if binary, or a Gaussian(0,1)
if continuous.
The treatment vector is sampled from a
Bernoulli
(\frac{1}{1+ \exp(1x_1+x_2x_3)}
), enforcing the treatment
assignment probabilities to be a function of observed covariates.
The potential outcomes (y(0)
and y(1)
) are then sampled from a Bernoulli
if binary, or a Gaussian (with standard deviation equal to 1) if continuous.
Their mean is equal to a confounding term (null, linear or nonlinear and
always null for binary outcome) plus 14 decision rules weighted by the
treatment effect magnitude. The two potential outcomes characterizes the CATE
(and then the unobserved individual treatment effects vector) as the sum of
different additive contributions for each decision rules considered
(plus an intercept).
The final expression of the CATE depends on the treatment effect magnitude
and the number of decision rules considered.
The 4 decision rules are:
Rule 1:
1\{x_1 > 0.5; x_2 \leq 0.5\}(\textbf{x})
Rule 2:
1\{x_5 > 0.5; x_6 \leq 0.5\}(\textbf{x})
Rule 3:
1\{x_4 \leq 0.5\}(\textbf{x})
Rule 4:
1\{x_5 \leq 0.5; x_7 > 0.5; x_8 \leq 0.5\}(\textbf{x})
with corresponding additive average treatment effect (AATE) equal to:Rule 1:

effect_size
,Rule 2:
+
effect_size
,Rule 3:
 0.5 \cdot
effect_size
,Rule 4:
+ 2 \cdot
effect_size
.
In example, setting effect_size
=4 and n_rules
=2:
\text{CATE}(\textbf{x}) = 4 \cdot 1\{x_1 > 0.5; x_2 \leq 0.5\}(\textbf{x}) +
4 \cdot 1\{x_5 > 0.5; x_6 \leq 0.5\}(\textbf{x})
The final outcome vector y
is finally computed by combining the potential
outcomes according to the treatment assignment.
Value
A list, representing the generated synthetic data set, containing:
y 
an outcome vector, 
z 
a treatment vector, 
X 
a covariates matrix, 
ite 
an individual treatment vector. 
Note
Set the covariates domain (binary_covariates
) and outcome domain
(binary_outcome
) according to the experiment of interest.
Increase complexity in heterogeneity discovery:
decreasing the sample size (
n
),adding correlation among covariates (
rho
),increasing the number of rules (
n_rules
),increasing the number of covariates (
p
),decreasing the absolute value of the causal effect (
effect_size
),adding linear or notlinear confounders (
confounding
).
Examples
set.seed(123)
dataset < generate_cre_dataset(n = 1000, rho = 0, n_rules = 2, p = 10,
effect_size = 2, binary_covariates = TRUE,
binary_outcome = TRUE, confounding = "no")