ICP {InvariantCausalPrediction} | R Documentation |
Invariant Causal Prediction
Description
Confidence intervals for causal effects in a regression setting.
Usage
ICP(X, Y, ExpInd, alpha=0.01, test = "normal",
selection = c("lasso", "all", "stability", "boosting")
[if (ncol(X) <=8) 2 else 4],
maxNoVariables = 8, maxNoVariablesSimult = 8,
maxNoObs = 200,
showAcceptedSets = TRUE,
showCompletion = TRUE,
stopIfEmpty = FALSE,
gof = max(0.01,alpha))
Arguments
X |
A matrix (or data frame) with the predictor variables for all experimental settings |
Y |
The response or target variable of interest. Can be numeric for regression or a factor with two levels for binary classification. |
ExpInd |
Indicator of the experiment or the intervention type an observation belongs to.
Can be a numeric vector of the same length as |
alpha |
Determines coverage of the confidence regions. |
test |
Use "exact" for an exact test in a regression setting, especially if sample size is small. However, this test is computationally demanding if sample size is high. The default "normal" tests for a shift in mean and variance of the residuals between different populations/environments. Using "correlation" tests additionally for vanishing correlation between predictor variables and residuals in each environment. Other options are "ranks" that uses a rank-based alternative and "ks" for a Kolmogorov-Smirnov test to detect differences in the distributions of the residuals between different environments. It is also possible to supply a function of the form |
selection |
The method for pre-selection of variables to save computational resources. Can use "all" for no pre-selection (which guarantees coverage but might take longer to compute), "boosting" for a boosting-type, "lasso" for Lasso cross-validated or "stability" for a stability-selection-type pre-selection. Default is "all" if p does not exceed 10 and "boosting" otherwise. |
maxNoVariables |
The maximal number of variables to pre-select (choosing smaller values saves computational resources but increases approximation error). |
maxNoVariablesSimult |
The maximal size of sets of variables considered in the procedure
(same comment as for |
maxNoObs |
The maximal number of observations used for the "exact" test
(same comment as for |
showAcceptedSets |
If |
showCompletion |
If |
stopIfEmpty |
If |
gof |
If no set of variables (including the empty set) leads to a p-value larger than the goodness-of-fit cutoff |
Value
A list with elements
ConfInt |
The matrix with confidence intervals for the causal coefficient of all variables. First row is the upper bound and second row the lower bound. |
maximinCoefficients |
The value in the confidence interval closest to 0. Is hence non-zero for variables with significant effects. |
colnames |
The names of the variables (replaced with generic "Variable 1" etc. if not available). |
factor |
Logical indicating whether the response is a factor or not. |
dimX |
The dimensions of the matrix with predictor variables. |
Coeff |
A list which contains for all variables the vector with point-estimates among all accepted sets where the variable was part of the set. |
CoeffVar |
Same as |
modelReject |
Logical indicating if the whole model was rejected (the p-value of the best fitting model is too low). |
acceptedSets |
A list with one element per accepted set. For each accepted model the list entry is a vector that contains the indices of the variables in the accepted set. |
usedvariables |
The pre-selected variables if not all variables are used for the analysis; otherwise all variables. |
pvalues |
The p-values of all variables. |
stopIfEmpty |
A boolean value indicating whether computations stop as soon as intersection of accepted sets is empty. |
noEnv |
The number of distinct environments. |
gof |
The goodness-of-fit cutoff |
bestModel |
The largest p-value across all tested sets of variables. |
Author(s)
Nicolai Meinshausen <meinshausen@stat.math.ethz.ch>
References
Jonas Peters, Peter Buhlmann, Nicolai Meinshausen (2015):
Causal inference using invariant prediction: identification and confidence intervals
arxiv preprint http://arxiv.org/abs/1501.01332
See Also
hiddenICP
for reconstructing the parents of a variable in the presence of hidden variables (but assuming shift/additive interventions), which also allows construction of confidence intervals for the causal coefficients. See package "backShift" for constructing point estimates of causal cyclic models in the presence of hidden variables (again under shift interventions).
Examples
##########################################
####### 1st example:
####### Simulate data with interventions
set.seed(1)
## sample size n
n <- 4000
## 5 predictor variables
p <- 5
## simulate as independent Gaussian variables
X <- matrix(rnorm(n*p),nrow=n)
## divide data into observational (ExpInd=1) and interventional (ExpInd=2)
ExpInd <- c(rep(1,n/2),rep(2,n/2))
## for interventional data (ExpInd==2): change distribution
X[ExpInd==2,] <- sweep(X[ExpInd==2,],2, 5*rnorm(p) ,FUN="*")
## first two variables are the causal predictors of Y
beta <- c(1,1,rep(0,p-2))
## response variable Y
Y <- as.numeric(X%*%beta + rnorm(n))
## optinal: make last variable a child of Y (so last variable is non-causal for Y)
X[,p] <- 0.3*Y + rnorm(n)
####### Compute "Invariant Causal Prediction" Confidence Intervals
icp <- ICP(X,Y,ExpInd)
###### Print/plot/show summary of output
print(icp)
plot(icp)
#### compare with linear model
cat("\n compare with linear model \n")
print(summary(lm(Y~X)))
##########################################
####### 2nd example:
####### Simulate a DAG where X1 -> Y, Y -> X2 and Y -> X3
####### noise interventions on second half of data on X1
####### structure of DAG (at Y -> X2) is changing under interventions
n1 <- 400
n2 <- 500
ExpInd <- c(rep(1,n1), rep(2,n2))
## index for observational (ExpInd=1) and intervention data (ExpInd=2)
X1 <- c(rnorm(n1), 2 * rnorm(n2) + 1)
Y <- 0.5 * X1 + 0.2 * rnorm(n1 + n2)
X2 <- c(1.5 * Y[1:n1] + 0.4 * rnorm(n1),
-0.3 * Y[(n1+1):n2] + 0.4 * rnorm(n2))
X3 <- -0.4 * Y + 0.2 * rnorm(n1 + n2)
X <- cbind(X1, X2, X3)
### Compute "Invariant Causal Prediction" Confidence Intervals
## use a rank-based test to detect shift in distribution of residuals
icp <- ICP(X, Y, ExpInd,test="ranks")
## use a Kolmogorov-Smirnov test to detect shift in distribution of residuals
icp <- ICP(X, Y, ExpInd,test="ks")
## can also supply test as a function
## here chosen to be equivalent to option "ks" above
icp <- ICP(X, Y, ExpInd,test=function(x,z) ks.test(x,z)$p.val)
## use a test based on normal approximations
icp <- ICP(X, Y, ExpInd, test="normal")
### Print/plot/show summary of output
print(icp)
plot(icp)
#### compare with linear model
cat("\n compare with linear model \n")
print(summary(lm(Y~X)))
## Not run:
##########################################
####### 3rd example:
####### College Distance data
library(AER)
data("CollegeDistance")
CD <- CollegeDistance
## define two experimental settings by
## distance to closest 4-year college
ExpInd <- list()
ExpInd[[1]] <- which(CD$distance < quantile(CD$distance,0.5))
ExpInd[[2]] <- which(CD$distance >= quantile(CD$distance,0.5))
## target variable is binary (did education lead at least to BA degree?)
Y <- as.factor(CD$education>=16)
## use these predictors
X <- CD[,c("gender","ethnicity","score","fcollege","mcollege","home",
"urban","unemp","wage","tuition","income","region")]
## searching all subsets (use selection="lasso" or selection="stability"
## to select a subset of subsets to search)
## with selection="all" the function will take several minutes
icp <- ICP(X,Y,ExpInd,selection="all",alpha=0.1)
## Print/plot/show summary of output
print(icp)
summary(icp)
plot(icp)
## End(Not run)