scpi {scpi} | R Documentation |
Prediction Intervals for Synthetic Control Methods
Description
The command implements estimation and inference procedures for Synthetic Control (SC) methods using least squares, lasso, ridge, or simplex-type constraints. Uncertainty is quantified using prediction
intervals according to Cattaneo, Feng, and Titiunik (2021). scpi
returns the estimated
post-treatment series for the synthetic unit through the command scest
and quantifies in-sample and out-of-sample uncertainty to provide confidence intervals
for each point estimate.
Companion Stata and Python packages are described in Cattaneo, Feng, Palomba, and Titiunik (2022).
Companion commands are: scdata and scdataMulti for data preparation in the single and multiple treated unit(s) cases, respectively, scest for point estimation, scplot and scplotMulti for plots in the single and multiple treated unit(s) cases, respectively.
Related Stata, R, and Python packages useful for inference in SC designs are described in the following website:
https://nppackages.github.io/scpi/
For an introduction to synthetic control methods, see Abadie (2021) and references therein.
Usage
scpi(
data,
w.constr = NULL,
V = "separate",
V.mat = NULL,
solver = "ECOS",
P = NULL,
u.missp = TRUE,
u.sigma = "HC1",
u.order = 1,
u.lags = 0,
u.design = NULL,
u.alpha = 0.05,
e.method = "all",
e.order = 1,
e.lags = 0,
e.design = NULL,
e.alpha = 0.05,
sims = 200,
rho = NULL,
rho.max = 0.2,
lgapp = "generalized",
cores = 1,
plot = FALSE,
plot.name = NULL,
w.bounds = NULL,
e.bounds = NULL,
save.data = NULL,
verbose = TRUE
)
Arguments
data |
a class 'scdata' object, obtained by calling |
w.constr |
a list specifying the constraint set the estimated weights of the donors must belong to.
|
V |
specifies the type of weighting matrix to be used when minimizing the sum of squared residuals
The default is the identity matrix, so equal weight is given to all observations. In the case of multiple treated observations
(you used |
V.mat |
A conformable weighting matrix
See the Details section for more information on how to prepare this matrix. |
solver |
a string containing the name of the solver used by |
P |
a |
u.missp |
a logical indicating if misspecification should be taken into account when dealing with |
u.sigma |
a string specifying the type of variance-covariance estimator to be used when estimating
the conditional variance of |
u.order |
a scalar that sets the order of the polynomial in |
u.lags |
a scalar that sets the number of lags of |
u.design |
a matrix with the same number of rows of |
u.alpha |
a scalar specifying the confidence level for in-sample uncertainty, i.e. 1 - |
e.method |
a string selecting the method to be used in quantifying out-of-sample uncertainty among:
"gaussian" which uses conditional subgaussian bounds; "ls" which specifies a location-scale model for |
e.order |
a scalar that sets the order of the polynomial in |
e.lags |
a scalar that sets the number of lags of |
e.design |
a matrix with the same number of rows of |
e.alpha |
a scalar specifying the confidence level for out-of-sample uncertainty, i.e. 1 - |
sims |
a scalar providing the number of simulations to be used in quantifying in-sample uncertainty. |
rho |
a string specifying the regularizing parameter that imposes sparsity on the estimated vector of weights. If
|
rho.max |
a scalar indicating the maximum value attainable by the tuning parameter |
lgapp |
selects the way local geometry is approximated in simulation. The options are "generalized" and "linear". The first one accommodates for possibly non-linear constraints, whilst the second one is valid with linear constraints only. |
cores |
number of cores to be used by the command. The default is one. |
plot |
a logical specifying whether |
plot.name |
a string containing the name of the plot (the format is by default .png). For more options see |
w.bounds |
a |
e.bounds |
a |
save.data |
a character specifying the name and the path of the saved dataframe containing the processed data used to produce the plot. |
verbose |
if |
Details
Information is provided for the simple case in which N_1=1
if not specified otherwise.
Estimation of Weights.
w.constr
specifies the constraint set on the weights. First, the elementp
allows the user to choose between imposing a constraint on either the L1 (p = "L1"
) or the L2 (p = "L2"
) norm of the weights and imposing no constraint on the norm (p = "no norm"
). Second,Q
specifies the value of the constraint on the norm of the weights. Third,lb
sets the lower bound of each component of the vector of weights. Fourth,dir
sets the direction of the constraint on the norm in casep = "L1"
orp = "L2"
. Ifdir = "=="
, then||\mathbf{w}||_p = Q,\:\:\: w_j \geq lb,\:\: j =1,\ldots,J
If instead
dir = "<="
, then||\mathbf{w}||_p \leq Q,\:\:\: w_j \geq lb,\:\: j =1,\ldots,J
If instead
dir = "NULL"
no constraint on the norm of the weights is imposed.An alternative to specifying an ad-hoc constraint set on the weights would be choosing among some popular types of constraints. This can be done by including the element '
name
' in the listw.constr
. The following are available options:-
If
name == "simplex"
(the default), then||\mathbf{w}||_1 = 1,\:\:\: w_j \geq 0,\:\: j =1,\ldots,J.
-
If
name == "lasso"
, then||\mathbf{w}||_1 \leq Q,
where
Q
is by default equal to 1 but it can be provided as an element of the list (eg.w.constr = list(name = "lasso", Q = 2)
). If
name == "ridge"
, then||\mathbf{w}||_2 \leq Q,
where
Q
is a tuning parameter that is by default computed as(J+KM) \widehat{\sigma}_u^{2}/||\widehat{\mathbf{w}}_{OLS}||_{2}^{2}
where
J
is the number of donors andKM
is the total number of covariates used for adjustment. The user can provideQ
as an element of the list (eg.w.constr = list(name = "ridge", Q = 1)
).If
name == "ols"
, then the problem is unconstrained and the vector of weights is estimated via ordinary least squares.If
name == "L1-L2"
, then||\mathbf{w}||_1 = 1,\:\:\: ||\mathbf{w}||_2 \leq Q,
where
Q
is a tuning parameter computed as in the "ridge" case.
-
Weighting Matrix.
if
V <- "separate"
, then\mathbf{V} = \mathbf{I}
and the minimized objective function is\sum_{i=1}^{N_1} \sum_{l=1}^{M} \sum_{t=1}^{T_{0}}\left(a_{t, l}^{i}-\mathbf{b}_{t, l}^{{i \prime }} \mathbf{w}^{i}-\mathbf{c}_{t, l}^{{i \prime}} \mathbf{r}_{l}^{i}\right)^{2},
which optimizes the separate fit for each treated unit.
if
V <- "pooled"
, then\mathbf{V} = \mathbf{1}\mathbf{1}'\otimes \mathbf{I}
and the minimized objective function is\sum_{l=1}^{M} \sum_{t=1}^{T_{0}}\left(\frac{1}{N_1^2} \sum_{i=1}^{N_1}\left(a_{t, l}^{i}-\mathbf{b}_{t, l}^{i \prime} \mathbf{w}^{i}-\mathbf{c}_{t, l}^{i\prime} \mathbf{r}_{l}^{i}\right)\right)^{2},
which optimizes the pooled fit for the average of the treated units.
if the user wants to provide their own weighting matrix, then it must use the option
V.mat
to input av\times v
positive-definite matrix, wherev
is the number of rows of\mathbf{B}
(or\mathbf{C}
) after potential missing values have been removed. In case the user wants to provide their ownV
, we suggest to check the appropriate dimensionv
by inspecting the output of eitherscdata
orscdataMulti
and check the dimensions of\mathbf{B}
(and\mathbf{C}
). Note that the weighting matrix could cause problems to the optimizer if not properly scaled. For example, if\mathbf{V}
is diagonal we suggest to divide each of its entries by\|\mathrm{diag}(\mathbf{V})\|_1
.
Regularization.
rho
is estimated through the formula\varrho = \mathcal{C}\frac{\log (T_0)^c}{T_0^{1/2}}
where
\mathcal{C} = \widehat{\sigma}_u / \min_j \widehat{\sigma}_{b_j}
ifrho = 'type-1'
,\mathcal{C} = \max_{j}\widehat{\sigma}_{b_j}\widehat{\sigma}_{u} / \min_j \widehat{\sigma}_{b_j}^2
ifrho = 'type-2'
, and\mathcal{C} = \max_{j}\widehat{\sigma}_{b_ju} / \min_j \widehat{\sigma}_{b_j}^2
ifrho = 'type-3'
,rho
defines a new sparse weight vector as\widehat{w}^\star_j = \mathbf{1}(\widehat{w}_j\geq \varrho)
In-sample uncertainty. To quantify in-sample uncertainty it is necessary to model the pseudo-residuals
\mathbf{u}
. First of all, estimation of the first moment of\mathbf{u}
can be controlled through the optionu.missp
. Whenu.missp = FALSE
, then\mathbf{E}[u\: |\: \mathbf{D}_u]=0
. If insteadu.missp = TRUE
, then\mathbf{E}[\mathbf{u}\: |\: \mathbf{D}_u]
is estimated using a linear regression of\widehat{\mathbf{u}}
on\mathbf{D}_u
. The default set of variables in\mathbf{D}_u
is composed of\mathbf{B}
,\mathbf{C}
and, if required, it is augmented with lags (u.lags
) and polynomials (u.order
) of\mathbf{B}
. The optionu.design
allows the user to provide an ad-hoc set of variables to form\mathbf{D}_u
. Regarding the second moment of\mathbf{u}
, different estimators can be chosen: HC0, HC1, HC2, HC3, and HC4 using the optionu.sigma
.Out-of-sample uncertainty. To quantify out-of-sample uncertainty it is necessary to model the out-of-sample residuals
\mathbf{e}
and estimate relevant moments. By default, the design matrix used during estimation\mathbf{D}_e
is composed of the blocks in\mathbf{B}
and\mathbf{C}
corresponding to the outcome variable. Moreover, if required by the user,\mathbf{D}_e
is augmented with lags (e.lags
) and polynomials (e.order
) of\mathbf{B}
. The optione.design
allows the user to provide an ad-hoc set of variables to form\mathbf{D}_e
. Finally, the optione.method
allows the user to select one of three estimation methods: "gaussian" relies on conditional sub-Gaussian bounds; "ls" estimates conditional bounds using a location-scale model; "qreg" uses conditional quantile regression of the residuals\mathbf{e}
on\mathbf{D}_e
.Residual Estimation Over-fitting. To estimate conditional moments of
\mathbf{u}
ande_t
we rely on two design matrices,\mathbf{D}_u
and\mathbf{D}_e
(see above). Letd_u
andd_e
be the number of columns in\mathbf{D}_u
and\mathbf{D}_e
, respectively. Assuming no missing values and balanced features, the number of observation used to estimate moments of\mathbf{u}
isN_1\cdot T_0\cdot M
, whilst for moments ofe_t
isT_0
. Our rule of thumb to avoid over-fitting is to check ifN_1\cdot T_0\cdot M \geq d_u + 10
orT_0 \geq d_e + 10
. If the former condition is not satisfied we automatically setu.order = u.lags = 0
, if instead the latter is not met we automatically sete.order = e.lags = 0
.
Value
The function returns an object of class 'scpi' containing three lists. The first list is labeled 'data' and contains used
data as returned by scdata
and some other values.
A |
a matrix containing pre-treatment features of the treated unit(s). |
B |
a matrix containing pre-treatment features of the control units. |
C |
a matrix containing covariates for adjustment. |
P |
a matrix whose rows are the vectors used to predict the out-of-sample series for the synthetic unit(s). |
Y.pre |
a matrix containing the pre-treatment outcome of the treated unit(s). |
Y.post |
a matrix containing the post-treatment outcome of the treated unit(s). |
Y.pre.agg |
a matrix containing the aggregate pre-treatment outcome of the treated unit(s). This differs from
Y.pre only in the case 'effect' in |
Y.post.agg |
a matrix containing the aggregate post-treatment outcome of the treated unit(s). This differs from
Y.post only in the case 'effect' in |
Y.donors |
a matrix containing the pre-treatment outcome of the control units. |
specs |
a list containing some specifics of the data:
|
The second list is labeled 'est.results' containing all the results from scest
.
w |
a matrix containing the estimated weights of the donors. |
r |
a matrix containing the values of the covariates used for adjustment. |
b |
a matrix containing |
Y.pre.fit |
a matrix containing the estimated pre-treatment outcome of the SC unit(s). |
Y.post.fit |
a matrix containing the estimated post-treatment outcome of the SC unit(s). |
A.hat |
a matrix containing the predicted values of the features of the treated unit(s). |
res |
a matrix containing the residuals |
V |
a matrix containing the weighting matrix used in estimation. |
w.constr |
a list containing the specifics of the constraint set used on the weights. |
The third list is labeled 'inference.results' and contains all the inference-related results.
CI.in.sample |
a matrix containing the prediction intervals taking only in-sample uncertainty in to account. |
CI.all.gaussian |
a matrix containing the prediction intervals estimating out-of-sample uncertainty with sub-Gaussian bounds. |
CI.all.ls |
a matrix containing the prediction intervals estimating out-of-sample uncertainty with a location-scale model. |
CI.all.qreg |
a matrix containing the prediction intervals estimating out-of-sample uncertainty with quantile regressions. |
bounds |
a list containing the estimated bounds (in-sample and out-of-sample uncertainty). |
Sigma |
a matrix containing the estimated (conditional) variance-covariance |
u.mean |
a matrix containing the estimated (conditional) mean of the pseudo-residuals |
u.var |
a matrix containing the estimated (conditional) variance-covariance of the pseudo-residuals |
e.mean |
a matrix containing the estimated (conditional) mean of the out-of-sample error |
e.var |
a matrix containing the estimated (conditional) variance of the out-of-sample error |
u.missp |
a logical indicating whether the model has been treated as misspecified or not. |
u.lags |
an integer containing the number of lags in B used in predicting moments of the pseudo-residuals |
u.order |
an integer containing the order of the polynomial in B used in predicting moments of the pseudo-residuals |
u.sigma |
a string indicating the estimator used for |
u.user |
a logical indicating whether the design matrix to predict moments of |
u.T |
a scalar indicating the number of observations used to predict moments of |
u.params |
a scalar indicating the number of parameters used to predict moments of |
u.D |
the design matrix used to predict moments of |
u.alpha |
a scalar determining the confidence level used for in-sample uncertainty, i.e. 1- |
e.method |
a string indicating the specification used to predict moments of the out-of-sample error |
e.lags |
an integer containing the number of lags in B used in predicting moments of the out-of-sample error |
e.order |
an integer containing the order of the polynomial in B used in predicting moments of the out-of-sample error |
e.user |
a logical indicating whether the design matrix to predict moments of |
e.T |
a scalar indicating the number of observations used to predict moments of |
e.params |
a scalar indicating the number of parameters used to predict moments of |
e.alpha |
a scalar determining the confidence level used for out-of-sample uncertainty, i.e. 1- |
e.D |
the design matrix used to predict moments of |
rho |
an integer specifying the estimated regularizing parameter that imposes sparsity on the estimated vector of weights. |
Q.star |
a list containing the regularized constraint on the norm. |
epskappa |
a vector containing the estimates for |
sims |
an integer indicating the number of simulations used in quantifying in-sample uncertainty. |
failed.sims |
a matrix containing the number of failed simulations per post-treatment period to estimate lower and upper bounds. |
Author(s)
Matias Cattaneo, Princeton University. cattaneo@princeton.edu.
Yingjie Feng, Tsinghua University. fengyj@sem.tsinghua.edu.cn.
Filippo Palomba, Princeton University (maintainer). fpalomba@princeton.edu.
Rocio Titiunik, Princeton University. titiunik@princeton.edu.
References
Abadie, A. (2021). Using synthetic controls: Feasibility, data requirements, and methodological aspects. Journal of Economic Literature, 59(2), 391-425.
Cattaneo, M. D., Feng, Y., and Titiunik, R. (2021). Prediction intervals for synthetic control methods. Journal of the American Statistical Association, 116(536), 1865-1880.
Cattaneo, M. D., Feng, Y., Palomba F., and Titiunik, R. (2022). scpi: Uncertainty Quantification for Synthetic Control Methods, arXiv:2202.05984.
Cattaneo, M. D., Feng, Y., Palomba F., and Titiunik, R. (2022). Uncertainty Quantification in Synthetic Controls with Staggered Treatment Adoption, arXiv:2210.05026.
See Also
scdata
, scdataMulti
, scest
, scplot
, scplotMulti
Examples
data <- scpi_germany
df <- scdata(df = data, id.var = "country", time.var = "year",
outcome.var = "gdp", period.pre = (1960:1990),
period.post = (1991:2003), unit.tr = "West Germany",
unit.co = setdiff(unique(data$country), "West Germany"),
constant = TRUE, cointegrated.data = TRUE)
result <- scpi(df, w.constr = list(name = "simplex", Q = 1), cores = 1, sims = 10)
result <- scpi(df, w.constr = list(lb = 0, dir = "==", p = "L1", Q = 1),
cores = 1, sims = 10)