R: Data Preparation for 'scest' or 'scpi' for Point Estimation...

scdata {scpi}

R Documentation

Data Preparation for `scest` or `scpi` for Point Estimation and Inference Procedures Using Synthetic Control Methods.

Description

The command prepares the data to be used by scest or scpi to implement estimation and inference procedures for Synthetic Control (SC) methods. It allows the user to specify the outcome variable, the features of the treated unit to be matched, and covariate-adjustment feature by feature. The names of the output matrices follow the terminology proposed in Cattaneo, Feng, and Titiunik (2021).

Companion Stata and Python packages are described in Cattaneo, Feng, Palomba, and Titiunik (2022).

Companion commands are: scdataMulti for data preparation in the multiple treated units case with staggered adoption, scest for point estimation, scpi for inference procedures, scplot and scplotMulti for plots in the single and multiple treated unit(s) cases, respectively.

Related Stata, R, and Python packages useful for inference in SC designs are described in the following website:

https://nppackages.github.io/scpi/

For an introduction to synthetic control methods, see Abadie (2021) and references therein.

Usage

scdata(
  df,
  id.var,
  time.var,
  outcome.var,
  period.pre,
  period.post,
  unit.tr,
  unit.co,
  features = NULL,
  cov.adj = NULL,
  cointegrated.data = FALSE,
  anticipation = 0,
  constant = FALSE,
  verbose = TRUE
)

Arguments

`df`	a dataframe object.
`id.var`	a character or numeric scalar with the name of the variable containing units' IDs. The ID variable can be numeric or character.
`time.var`	a character with the name of the time variable. The time variable has to be numeric, integer, or Date. In case `time.var` is Date it should be the output of `as.Date()` function. An integer or numeric time variable is suggested when working with yearly data, whereas for all other formats a Date type time variable is preferred.
`outcome.var`	a character with the name of the outcome variable. The outcome variable has to be numeric.
`period.pre`	a numeric vector that identifies the pre-treatment period in time.var.
`period.post`	a numeric vector that identifies the post-treatment period in time.var.
`unit.tr`	a character or numeric scalar that identifies the treated unit in `id.var`.
`unit.co`	a character or numeric vector that identifies the donor pool in `id.var`.
`features`	a character vector containing the name of the feature variables used for estimation. If this option is not specified the default is `features = outcome.var`.
`cov.adj`	a list specifying the names of the covariates to be used for adjustment for each feature. If `outcome.var` is not in the variables specified in `features`, we force `cov.adj<-NULL`. See the Details section for more.
`cointegrated.data`	a logical that indicates if there is a belief that the data is cointegrated or not. The default value is `FALSE`. See the Details section for more.
`anticipation`	a scalar that indicates the number of periods of potential anticipation effects. Default is 0.
`constant`	a logical which controls the inclusion of a constant term across features. The default value is `FALSE`.
`verbose`	if `TRUE` prints additional information in the console.

Details

cov.adj can be used in two ways. First, if only one feature is specified through the option features, cov.adj has to be a list with one (even unnamed) element (eg. cov.adj = list(c("constant","trend"))). Alternatively, if multiple features are specified, then the user has two possibilities:
- provide a list with one element, then the same covariates are used for adjustment for each feature. For example, if there are two features specified and the user inputs cov.adj = list(c("constant","trend")), then a constant term and a linear trend are for adjustment for both features.
- provide a list with as many elements as the number of features specified, then feature-specific covariate adjustment is implemented. For example, cov.adj = list('f1' = c("constant","trend"), 'f2' = c("trend")). In this case the name of each element of the list should be one (and only one) of the features specified. Note that if two (or more) features are specified and covariates adjustment has to be specified just for one of them, the user must still provide a list of the same length of the number of features, e.g., cov.adj = list('f1' = c("constant","trend"), 'f2' = NULL.
This option allows the user to include feature-specific constant terms or time trends by simply including "constant" or "trend" in the corresponding element of the list.

When outcome.var is not included in features, we automatically set \mathcal{R}=\emptyset, that is we do not perform covariate adjustment. This is because, in this setting it is natural to create the out-of-sample prediction matrix \mathbf{P} using the post-treatment outcomes of the donor units only.
cointegrated.data allows the user to model the belief that \mathbf{A} and \mathbf{B} form a cointegrated system. In practice, this implies that when dealing with the pseudo-true residuals \mathbf{u}, the first-difference of \mathbf{B} are used rather than the levels.

Value

The command returns an object of class 'scdata' containing the following

`A`	a matrix containing pre-treatment features of the treated unit.
`B`	a matrix containing pre-treatment features of the control units.
`C`	a matrix containing covariates for adjustment.
`P`	a matrix whose rows are the vectors used to predict the out-of-sample series for the synthetic unit.
`Y.pre`	a matrix containing the pre-treatment outcome of the treated unit.
`Y.post`	a matrix containing the post-treatment outcome of the treated unit.
`Y.donors`	a matrix containing the pre-treatment outcome of the control units.
`specs`	a list containing some specifics of the data: `J`, the number of control units `K`, a numeric vector with the number of covariates used for adjustment for each feature `KM`, the total number of covariates used for adjustment `M`, number of features `period.pre`, a numeric vector with the pre-treatment period `period.post`, a numeric vector with the post-treatment period `T0.features`, a numeric vector with the number of periods used in estimation for each feature `T1.outcome`, the number of post-treatment periods `outcome.var`, a character with the name of the outcome variable `features`, a character vector with the name of the features `constant`, for internal use only `out.in.features`, for internal use only `effect`, for internal use only `sparse.matrices`, for internal use only `treated.units`, list containing the IDs of all treated units

Author(s)

Matias Cattaneo, Princeton University. cattaneo@princeton.edu.

Yingjie Feng, Tsinghua University. fengyj@sem.tsinghua.edu.cn.

Filippo Palomba, Princeton University (maintainer). fpalomba@princeton.edu.

Rocio Titiunik, Princeton University. titiunik@princeton.edu.

References

Abadie, A. (2021). Using synthetic controls: Feasibility, data requirements, and methodological aspects. Journal of Economic Literature, 59(2), 391-425.
Cattaneo, M. D., Feng, Y., and Titiunik, R. (2021). Prediction intervals for synthetic control methods. Journal of the American Statistical Association, 116(536), 1865-1880.
Cattaneo, M. D., Feng, Y., Palomba F., and Titiunik, R. (2022). scpi: Uncertainty Quantification for Synthetic Control Methods, arXiv:2202.05984.
Cattaneo, M. D., Feng, Y., Palomba F., and Titiunik, R. (2022). Uncertainty Quantification in Synthetic Controls with Staggered Treatment Adoption, arXiv:2210.05026.

Examples


data <- scpi_germany

df <- scdata(df = data, id.var = "country", time.var = "year",
             outcome.var = "gdp", period.pre = (1960:1990),
             period.post = (1991:2003), unit.tr = "West Germany",
             unit.co = setdiff(unique(data$country), "West Germany"),
             constant = TRUE, cointegrated.data = TRUE)

[Package scpi version 2.2.5 Index]

Data Preparation for scest or scpi for Point Estimation and Inference Procedures Using Synthetic Control Methods.