R: Two Stage Curvature Identification with Polynomial Basis...

tsci_poly {TSCI}

R Documentation

Two Stage Curvature Identification with Polynomial Basis Expansion

Description

tsci_poly implements Two Stage Curvature Identification (Guo and Buehlmann 2022) with a basis expansion by monomials. Through a data-dependent way it tests for the smallest sufficiently large violation space among a pre-specified sequence of nested violation space candidates. Point and uncertainty estimates of the treatment effect for all violation space candidates including the selected violation space will be returned amongst other relevant statistics.

Usage

tsci_poly(
  Y,
  D,
  Z,
  X = NULL,
  W = X,
  vio_space = NULL,
  create_nested_sequence = TRUE,
  sel_method = c("comparison", "conservative"),
  min_order = 1,
  max_order = 10,
  exact_order = NULL,
  order_selection_method = c("grid search", "backfitting"),
  max_iter = 100,
  conv_tol = 10^-6,
  gcv = FALSE,
  nfolds = 5,
  sd_boot = TRUE,
  iv_threshold = 10,
  threshold_boot = TRUE,
  alpha = 0.05,
  intercept = TRUE,
  B = 300
)

Arguments

`Y`	observations of the outcome variable. Either a numeric vector of length n or a numeric matrix with dimension n by 1. If outcome variable is binary use dummy encoding.
`D`	observations of the treatment variable. Either a numeric vector of length n or a numeric matrix with dimension n by 1. If treatment variable is binary use dummy encoding.
`Z`	observations of the instrumental variable(s). Either a vector of length n or a matrix with dimension n by s. If observations are not numeric dummy encoding will be applied.
`X`	observations of baseline covariate(s). Either a vector of length n or a matrix with dimension n by p or `NULL` (if no covariates should be included). If observations are not numeric dummy encoding will be applied.
`W`	(transformed) observations of baseline covariate(s) used to fit the outcome model. Either a vector of length n or a matrix with dimension n by p_w or `NULL` (if no covariates should be included). If observations are not numeric dummy encoding will be applied.
`vio_space`	either `NULL` or a list with numeric vectors of length n and/or numeric matrices with n rows as elements to specify the violation space candidates. If observations are not numeric dummy encoding will be applied. See Details for more information. If `NULL`, then the violation space candidates are chosen to be a nested sequence of monomials with degree depending on the orders of the polynomials used to fit the treatment model.
`create_nested_sequence`	logical. If `TRUE`, the violation space candidates (in form of matrices) are defined sequentially starting with an empty violation matrix and subsequently adding the next element of `vio_space` to the current violation matrix. If `FALSE,` the violation space candidates (in form of matrices) are defined as the empty space and the elements of `vio_space`. See Details for more information.
`sel_method`	The selection method used to estimate the treatment effect. Either "comparison" or "conservative". See Details.
`min_order`	either a single integer value or a vector of integer values of length s specifying the smallest order of polynomials to use in the selection of the treatment model. If a single integer value is provided, the polynomials of all instrumental variables use this value.
`max_order`	either a single integer value or a vector of integer values of length s specifying the largest order of polynomials to use in the selection of the treatment model. If a single integer value is provided, the polynomials of all instrumental variables use this value.
`exact_order`	either a single integer value or a vector of integer values of length s specifying the exact order of polynomials to use in the treatment model. If a single integer value is provided, the polynomials of all instrumental variables use this value.
`order_selection_method`	method used to select the best fitting order of polynomials for the treatment model. Must be either 'grid search' or 'backfitting'. 'grid search' can be very slow if the number of instruments is large.
`max_iter`	number of iterations used in the backfitting algorithm if `order_selection_method` is 'backfitting'. Has to be a positive integer value.
`conv_tol`	tolerance of convergence in the backfitting algorithm if `order_selection_method` is 'backfitting'.
`gcv`	logical. If `TRUE`, the generalized cross-validation mean squared error is used to determine the best fitting order of polynomials for the treatment model. If `FALSE`, k-fold cross-validation is used instead.
`nfolds`	number of folds used for the k-fold cross-validation if `gcv` is `FALSE`. Has to be a positive integer value.
`sd_boot`	logical. if `TRUE`, it determines the standard error using a bootstrap approach.
`iv_threshold`	a numeric value specifying the minimum of the threshold of IV strength test.
`threshold_boot`	logical. if `TRUE`, it determines the threshold of the IV strength using a bootstrap approach. If `FALSE`, it does not perform a bootstrap. See Details.
`alpha`	the significance level. Has to be a numeric value between 0 and 1.
`intercept`	logical. If `TRUE`, an intercept is included in the outcome model.
`B`	number of bootstrap samples. Has to be a positive integer value. Bootstrap methods are used to calculate the iv strength threshold if `threshold_boot` is `TRUE` and for the violation space selection.

Details

The treatment and outcome models are assumed to be of the following forms:

D_i = f(Z_i, X_i) + \delta_i

Y_i = \beta \cdot D_i + h(Z_i, X_i) + \phi(X_i) + \epsilon_i

where f(Z_i, X_i) is estimated using a polynomial basis expansion of the instrumental variables and a linear combination of the baseline covariates, h(Z_i X_i) is approximated using the violation space candidates and \phi(X_i) is approximated by a linear combination of the columns in W. The errors are allowed to be heteroscedastic.

The violation space candidates should be in a nested sequence as the violation space selection is performed by comparing the treatment estimate obtained by each violation space candidate with the estimates of all violation space candidates further down the list vio_space that provide enough IV strength. Only if no significant difference was found in all of those comparisons, the violation space candidate will be selected. If sel_method is 'comparison', the treatment effect estimate of this violation space candidate will be returned. If sel_method is 'conservative', the treatment effect estimate of the successive violation space candidate will be returned provided that the IV strength is large enough. If vio_space is NULL the violation space candidates are chosen to be a nested sequence of polynomials of the instrumental variables up to the degrees used to fit the treatment model. This guarantees that the possible spaces of the violation will be tested. If the functional form of the outcome model is not well-known it is advisable to use the default values for W and vio_space.

The instrumental variable(s) are considered strong enough for violation space candidate V_q if the estimated IV strength using this violation space candidate is larger than the obtained value of the threshold of the IV strength. The formula of the threshold of the IV strength has the form \min \{\max \{ 2 \cdot \mathrm{Trace} [ \mathrm{M} (V_q) ], \mathrm{iv{\_}threshold} \} + S (V_q), 40 \} if threshold_boot is TRUE, and \min \{\max \{ 2 \cdot \mathrm{Trace} [ \mathrm{M} (V_q) ], \mathrm{iv{\_}threshold} \}, 40 \} if threshold_boot is FALSE. The matrix \mathrm{M} (V_q) depends on the hat matrix obtained from estimating f(Z_i, X_i), the violation space candidate V_q and the variables to include in the outcome model W. S (V_q) is obtained using a bootstrap and aims to adjust for the estimation error of the IV strength. Usually, the value of the threshold of the IV strength obtained using the bootstrap approach is larger. Thus, using threshold_boot equals TRUE leads to a more conservative IV strength test. For more information see subsection 3.3 in Guo and Buehlmann (2022).

See also Carl et al. (2023) for more details.

Value

A list containing the following elements:

Coef_all: a series of point estimates of the treatment effect obtained by the different violation space candidates.
sd_all: standard errors of the estimates of the treatmnet effect obtained by the different violation space candidates.
pval_all: p-values of the treatment effect estimates obtained by the different violation space candidates.
CI_all: confidence intervals for the treatment effect obtained by the different violation space candidates.
Coef_sel: the point estimator of the treatment effect obtained by the selected violation space candidate(s).
sd_sel: the standard error of Coef_sel.
pval_sel: p-value of the treatment effect estimate obtained by the selected violation space candidate(s).
CI_sel: confidence interval for the treatment effect obtained by the selected violation space candidate(s).
iv_str: IV strength using the different violation space candidates.
iv_thol: the threshold for the IV strength using the different violation space candidates.
Qmax: the violation space candidate that was the largest violation space candidate for which the IV strength was considered large enough determined by the IV strength test. If 0, the IV Strength test failed for the first violation space candidate. Otherwise, violation space selection was performed.
q_comp: the violation space candidate that was selected by the comparison method over the multiple data splits.
q_cons: the violation space candidate that was selected by the conservative method over the multiple data splits.
invalidity: shows whether the instrumental variable(s) were considered valid, invalid or too weak to test for violations. The instrumental variables are considered too weak to test for violations if the IV strength is already too weak using the first violation space candidate (besides the empty violation space). Testing for violations is always performed by using the comparison method.
mse: the out-of-sample mean squared error of the treatment model.

References

Zijian Guo, and Peter Buehlmann. Two Stage Curvature Identification with Machine Learning: Causal Inference with Possibly Invalid Instrumental Variables. arXiv:2203.12808, 2022
David Carl, Corinne Emmenegger, Peter Buehlmann, and Zijian Guo. TSCI: two stage curvature identification for causal inference with invalid instruments. arXiv:2304.00513, 2023

Examples

### a small example without baseline covariates
if (require("MASS")) {
  # sample size
  n <- 100
  # the IV strength
  a <- 1
  # the violation strength
  tau <- 1
  # true effect
  beta <- 1
  # treatment model
  f <- function(x) {1 + a * (x + x^2)}
  # outcome model
  g <- function(x) {1 + tau * x}

  # generate data
  mu_error <- rep(0, 2)
  Cov_error <- matrix(c(1, 0.5, 0.5, 1), 2, 2)
  Error <- MASS::mvrnorm(n, mu_error, Cov_error)
  # instrumental variable
  Z <- rnorm(n)
  # treatment variable
  D <- f(Z) + Error[, 1]
  # outcome variable
  Y <- beta * D + g(Z) + Error[, 2]

  # Two Stage Polynomials
  output_PO <- tsci_poly(Y, D, Z, max_order = 3, max_iter = 20, B = 100)
  summary(output_PO)
}

[Package TSCI version 3.0.4 Index]