init.causalForest {htetree}R Documentation

Causal Effect Regression and Estimation Forests (Tree Ensembles)

Description

Build a random causal forest by fitting a user selected number of causalTree models to get an ensemble of rpart objects.

Usage

init.causalForest(
  formula,
  data,
  treatment,
  weights = FALSE,
  cost = FALSE,
  num.trees,
  ncov_sample
)

## S3 method for class 'causalForest'
predict(object, newdata, predict.all = FALSE, type = "vector", ...)

causalForest(
  formula,
  data,
  treatment,
  na.action = na.causalTree,
  split.Rule = "CT",
  double.Sample = TRUE,
  split.Honest = TRUE,
  split.Bucket = FALSE,
  bucketNum = 5,
  bucketMax = 100,
  cv.option = "CT",
  cv.Honest = TRUE,
  minsize = 2L,
  propensity,
  control,
  split.alpha = 0.5,
  cv.alpha = 0.5,
  sample.size.total = floor(nrow(data)/10),
  sample.size.train.frac = 0.5,
  mtry = ceiling(ncol(data)/3),
  nodesize = 1,
  num.trees = nrow(data),
  cost = FALSE,
  weights = FALSE,
  ncolx,
  ncov_sample
)

Arguments

formula

a formula, with a response and features but no interaction terms. If this a a data frome, that is taken as the model frame (see model.frame).

data

an optional data frame that includes the variables named in the formula.

treatment

a vector that indicates the treatment status of each observation. 1 represents treated and 0 represents control. Only binary treatment supported in this version.

weights

optional case weights.

cost

a vector of non-negative costs, one for each variable in the model. Defaults to one for all variables. These are scalings to be applied when considering splits, so the improvement on splitting on a variable is divided by its cost in deciding which split to choose.

num.trees

Number of trees to be built in the causal forest

ncov_sample

Number of covariates randomly sampled to build each tree in the forest

object

a causalTree object

newdata

new data to predict

predict.all

If TRUE, return predicted individual effect for each observations. Otherwise, return the average effect.

type

the type of returned object

...

arguments to rpart.control may also be specified in the call to causalForest. They are checked against the list of valid arguments. The parameter minsize is implemented differently in causalTree than in rpart; we require a minimum of minsize treated observations and a minimum of minsize control observations in each leaf.

na.action

the default action deletes all observations for which y is missing, but keeps those in which one or more predictors are missing.

split.Rule

causalTree splitting options, one of "TOT", "CT", "fit", "tstats", four splitting rules in causalTree. Note that the "tstats" alternative does not have an associated cross-validation method cv.option; see Athey and Imbens (2016) for a discussion. Note further that split.Rule and cv.option can mix and match.

double.Sample

boolean option, TRUE or FALSE, if set to True, causalForest will build honest trees.

split.Honest

boolean option, TRUE or FALSE, used to decide the splitting rule of the trees.

split.Bucket

boolean option, TRUE or FALSE, used to specify whether to apply the discrete method in splitting the tree. If set as TRUE, in splitting a node, the observations in a leaf will be be partitioned into buckets, with each bucket containing bucketNum treated and bucketNum control units, and where observations are ordered prior to partitioning. Splitting will take place by bucket.

bucketNum

number of observations in each bucket when set split.Bucket = TRUE. However, the code will override this choice in order to guarantee that there are at least minsize and at most bucketMax buckets.

bucketMax

Option to choose maximum number of buckets to use in splitting when set split.Bucket = TRUE, bucketNum can change by choice of bucketMax.

cv.option

cross validation options, one of "TOT", "matching", "CT", "fit", four cross validation methods in causalTree. There is no cv.option for the split.Rule "tstats"; see Athey and Imbens (2016) for discussion.

cv.Honest

boolean option, TRUE or FALSE, only used for cv.option as "CT" or "fit", to specify whether to apply honest risk evalation function in cross validation. If set TRUE, use honest risk function, otherwise use adaptive risk function in cross validation. If set FALSE, the user choice of cv.alpha will be set to 1. If set TRUE, cv.alpha will default to 0.5, but the user choice of cv.alpha will be respected. Note that honest cv estimates within-leaf variances and may perform better with larger leaf sizes and/or small number of cross-validation sets.

minsize

in order to split, each leaf must have at least minsize treated cases and minsize control cases. The default value is set as 2.

propensity

propensity score used in "TOT" splitting and "TOT", honest "CT" cross validation methods. The default value is the proportion of treated cases in all observations. In this implementation, the propensity score is a constant for the whole dataset. Unit-specific propensity scores are not supported; however, the user may use inverse propensity scores as case weights if desired.

control

a list of options that control details of the rpart algorithm. See rpart.control.

split.alpha

scale parameter between 0 and 1, used in splitting risk evaluation function for "CT". When split.Honest = FALSE, split.alpha will be set as 1. For split.Rule="tstats", if split.Honest=TRUE, split.alpha is used in calculating the risk function, which determines the order of pruning in cross-validation.

cv.alpha

scale paramter between 0 and 1, used in cross validation risk evaluation function for "CT" and "fit". When cv.Honest = FALSE, cv.alpha will be set as 1.

sample.size.total

Sample size used to build each tree in the forest (sampled randomly with replacement).

sample.size.train.frac

Fraction of the sample size used for building each tree (training). For eexample, if the sample.size.total is 1000 and frac =0.5 then, 500 samples will be used to build the tree and the other 500 samples will be used the evaluate the tree.

mtry

Number of data features used to build a tree (This variable is not used presently).

nodesize

Minimum number of observations for treated and control cases in one leaf node

ncolx

Total number of covariates

Details

CausalForest builds an ensemble of CausalTrees (See Athey and Imbens, Recursive Partitioning for Heterogeneous Causal Effects (2016)), by repeated random sampling of the data with replacement. Further, each tree is built using a randomly sampled subset of all available covariates. A causal forest object is a list of trees. To predict, call R's predict function with new test data and the causalForest object (estimated on the training data) obtained after calling the causalForest function. During the prediction phase, the average value over all tree predictions is returned as the final prediction by default. To return the predictions of each tree in the forest for each test observation, set the flag predict.all=TRUE CausalTree differs from rpart function from rpart package in splitting rules and cross validation methods. Please check Athey and Imbens, Recursive Partitioning for Heterogeneous Causal Effects (2016) and Stefan Wager and Susan Athey, Estimation and Inference of Heterogeneous Treatment Effects using Random Forests for more details.

Value

An object of class rpart. See rpart.object.

References

Breiman L., Friedman J. H., Olshen R. A., and Stone, C. J. (1984) Classification and Regression Trees. Wadsworth.

Athey, S and G Imbens (2016) Recursive Partitioning for Heterogeneous Causal Effects. http://arxiv.org/abs/1504.01132

Wager,S and Athey, S (2015) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests http://arxiv.org/abs/1510.04342

See Also

causalTree honest.causalTree, rpart.control, rpart.object, summary.rpart, rpart.plot

Examples

library(rpart)
library("htetree")
cf <- causalForest(y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10, data=simulation.1,
  treatment=simulation.1$treatment,
  split.Rule="CT", split.Honest=TRUE,
  split.Bucket=FALSE, bucketNum = 5,
  bucketMax = 100, cv.option="CT", cv.Honest=TRUE, minsize = 2L,
  split.alpha = 0.5, cv.alpha = 0.5,
  sample.size.total = floor(nrow(simulation.1) / 2),
  sample.size.train.frac = .5,
  mtry = ceiling(ncol(simulation.1)/3), nodesize = 3, num.trees= 5,
  ncolx=10,ncov_sample=3)

cfpredtest <- predict.causalForest(cf, newdata=simulation.1[1:100,],
  type="vector")

[Package htetree version 0.1.18 Index]