R: Non-linear Invariant Causal Prediction

seqICPnl {seqICP}

R Documentation

Non-linear Invariant Causal Prediction

Description

Estimates the causal parents S of the target variable Y using invariant causal prediction and fits a general model of the form
Y = f(X^S) + N.

Usage

seqICPnl(X, Y, test = "block.variance", par.test = list(grid = c(0,
  round(nrow(X)/2), nrow(X)), complements = FALSE, link = sum, alpha = 0.05, B =
  100), regression.fun = function(X, Y) fitted.values(lm.fit(X, Y)),
  max.parents = ncol(X), stopIfEmpty = TRUE, silent = TRUE)

Arguments

`X`	matrix of predictor variables. Each column corresponds to one predictor variable.
`Y`	vector of target variable, with length(Y)=nrow(X).
`test`	string specifying the hypothesis test used to test for invariance of a parent set S (i.e. the null hypothesis H0_S). The following tests are available: "block.mean", "block.variance", "block.decoupled", "smooth.mean", "smooth.variance", "smooth.decoupled" and "hsic".
`par.test`	parameters specifying hypothesis test. The following parameters are available: `grid`, `complements`, `link`, `alpha` and `B`. The parameter `grid` is an increasing vector of gridpoints used to construct enviornments for change point based tests. If the parameter `complements` is 'TRUE' each environment is compared against its complement if it is 'FALSE' all environments are compared pairwise. The parameter `link` specifies how to compare the pairwise test statistics, generally this is either max or sum. The parameter `alpha` is a numeric value in (0,1) indicting the significance level of the hypothesis test. The parameter `B` is an integer and specifies the number of Monte-Carlo samples used in the approximation of the null distribution.
`regression.fun`	regression function used to fit the function f. This should be a function which takes the argument (X,Y) and outputs the predicted values f(Y).
`max.parents`	integer specifying the maximum size for admissible parents. Reducing this below the number of predictor variables saves computational time but means that the confidence intervals lose their coverage property.
`stopIfEmpty`	if ‘TRUE’, the procedure will stop computing confidence intervals if the empty set has been accepted (and hence no variable can have a signicificant causal effect). Setting to‘TRUE’ will save computational time in these cases, but means that the confidence intervals lose their coverage properties for values different to 0.
`silent`	If 'FALSE', the procedure will output progress notifications consisting of the currently computed set S together with the p-value resulting from the null hypothesis H0_S

Details

The function can be applied to models of the form
Y_i = f(X_i^S) + N_i
with iid noise N_i and f is from a specific function class, which the regression procedure given by the parameter regression.fun should be able to approximate.

The invariant prediction procedure is applied using the hypothesis test specified by the test parameter to determine whether a candidate model is invariant. For further details see the references.

Value

object of class 'seqICPnl' consisting of the following elements

`parent.set`	vector of the estimated causal parents.
`test.results`	matrix containing results from each individual hypothesis test H0_S as rows. The first column ind links the rows in the `test.results` matrix to the position in the list of the variable `S`.
`S`	list of all the sets that were tested. The position within the list corresponds to the index in the first column of the test.results matrix.
`p.values`	p-value for being not included in the set of true causal parents. (If a p-value is smaller than alpha, the corresponding variable is a member of parent.set.)
`stopIfEmpty`	a boolean value indicating whether computations stop as soon as intersection of accepted sets is empty.
`modelReject`	a boolean value indicating if the whole model was rejected (the p-value of the best fitting model is too low).
`alpha`	significance level at which the hypothesis tests were performed.
`n.var`	number of predictor variables.

Author(s)

Niklas Pfister and Jonas Peters

References

Pfister, N., P. Bühlmann and J. Peters (2017). Invariant Causal Prediction for Sequential Data. ArXiv e-prints (1706.08058).

Peters, J., P. Bühlmann, and N. Meinshausen (2016). Causal inference using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society, Series B (with discussion) 78 (5), 947–1012.

Examples

set.seed(2)

# environment 1
na <- 120
X1a <- 0.3*rnorm(na)
X3a <- X1a + 0.2*rnorm(na)
Ya <- 2*X1a^2 + 0.6*sin(X3a) + 0.1*rnorm(na)
X2a <- -0.5*Ya + 0.5*X3a + 0.1*rnorm(na)

# environment 2
nb <- 80
X1b <- 2*rnorm(nb)
X3b <- rnorm(nb)
Yb <- 2*X1b^2 + 0.6*sin(X3b) + 0.1*rnorm(nb)
X2b <- -0.5*Yb + 0.8*rnorm(nb)

# combine environments
X1 <- c(X1a,X1b)
X2 <- c(X2a,X2b)
X3 <- c(X3a,X3b)
Y <- c(Ya,Yb)
Xmatrix <- cbind(X1, X2, X3)

# use GAM as regression function
GAM <- function(X,Y){
  d <- ncol(X)
  if(d>1){
    formula <- "Y~1"
    names <- c("Y")
    for(i in 1:(d-1)){
      formula <- paste(formula,"+s(X",toString(i),")",sep="")
      names <- c(names,paste("X",toString(i),sep=""))
    }
    data <- data.frame(cbind(Y,X[,-1,drop=FALSE]))
      colnames(data) <- names
    fit <- fitted.values(mgcv::gam(as.formula(formula),data=data))
  } else{
    fit <- rep(mean(Y),nrow(X))
  }
  return(fit)
}

# Y follows the same structural assignment in both environments
# a and b (cf. the lines Ya <- ... and Yb <- ...).
# The direct causes of Y are X1 and X3.
# A GAM model fit considers X1, X2 and X3 as significant.
# All these variables are helpful for the prediction of Y.
summary(mgcv::gam(Y~s(X1)+s(X2)+s(X3)))

# apply seqICP to the same setting
seqICPnl.result <- seqICPnl(X = Xmatrix, Y, test="block.variance",
par.test = list(grid = seq(0, na + nb, (na + nb)/10), complements = FALSE, link = sum,
alpha = 0.05, B =100), regression.fun = GAM,  max.parents = 4, stopIfEmpty=FALSE, silent=FALSE)
summary(seqICPnl.result)
# seqICPnl is able to infer that X1 and X3 are causes of Y