estim.sur {ldt} | R Documentation |
Estimate a SUR Model
Description
Use this function to estimate a Seemingly Unrelated Regression model.
Usage
estim.sur(
data,
searchSigMaxIter = 0,
searchSigMaxProb = 0.1,
restriction = NULL,
pcaOptionsY = NULL,
pcaOptionsX = NULL,
simFixSize = 0,
simTrainFixSize = 0,
simTrainRatio = 0.75,
simSeed = 0,
simMaxConditionNumber = Inf
)
Arguments
data |
A list that determines data and other required information for the search process.
Use |
searchSigMaxIter |
An integer for the maximum number of iterations in searching for significant coefficients. Use 0 to disable the search. |
searchSigMaxProb |
A number for the maximum value of type I error to be used in searching for significant coefficients. If p-value is less than this, it is interpreted as significant. |
restriction |
A |
pcaOptionsY |
A list of options to use principal components of the endogenous data, instead of the actual values. Set |
pcaOptionsX |
A list of options to use principal components of the exogenous data, instead of the actual values. Set |
simFixSize |
An integer that determines the number of out-of-sample simulations. Use zero to disable the simulation. |
simTrainFixSize |
An integer representing the number of data points in the training sample in the out-of-sample simulation. If zero, |
simTrainRatio |
A number representing the size of the training sample relative to the available size, in the out-of-sample simulation. It is effective if |
simSeed |
A seed for the random number generator. Use zero for a random value. |
simMaxConditionNumber |
A number for the maximum value for the condition number in the simulation. |
Details
As described in section 10.2 in Greene (2020), this type of statistical model consists of multiple regression equations, where each equation may have a different set of exogenous variables and the disturbances between the equations are assumed to be correlated. The general form with m
equations can be written as y_i=z_i'\gamma_i+v_i
and E(v_i v_j)=\sigma_{ij}^2
for i=1,\ldots m
. Assuming that a sample of N
independent observations is available, we can stack the observations and use the following system for estimation:
Y = X B + V, \quad \mathrm{vec}B = R\gamma,
where the columns of Y:N \times m
contain the endogenous variables for each equation and the columns of X: N\times k
contain the explanatory variables, with k
being the number of unique explanatory variables in all equations. Note that X
combines the z_i
variables, and the restrictions imposed by R:mk\times q
and \gamma:q\times 1
determine a set of zero constraints on B: k \times m
, resulting in a system of equations with different sets of exogenous variables.
Imposing restrictions on the model using the R
matrix is not user-friendly, but it is suitable for use in this package, as users are not expected to specify such restrictions, but only to provide a list of potential regressors. Note that in this package, most procedures, including significance search, are supposed to be automated.
The unrestricted estimators (i.e., \hat{B}=(X'X)^{-1}X'Y
, and \hat{\Sigma}=(\hat{V}'\hat{V})/N
where \hat{V}=Y-X\hat{B}
) are used to initialize the feasible GLS estimators:
\tilde{B} = RW^{-1}R'[\hat{V}-1 \otimes x']\mathrm{vec}Y, \quad \tilde{\Sigma}=(\tilde{V}'\tilde{V})/N,
where W = R'[\hat{V}^{-1} \otimes X'X]R
and \tilde{V}=Y-X\tilde{B}
. The properties of these estimators are discussed in proposition 5.3 in Lütkepohl (2005). See also section 10.2 in Greene (2020). The maximum likelihood value is calculated by -\frac{N}{2}(m(\ln 2\pi+1)+\ln|\tilde{\Sigma}|)
. The condition number is calculated by multiplying 1-norm of W
and its inverse (e.g., see page 94 in Trefethen and Bau (1997)). Furthermore, given an out-of-sample observation such as x:k\times 1
, the prediction is y^f = \tilde{B}'x
, and its variance is estimated by the following formula:
\mathrm{var}y^f = \tilde{V} + (x' \otimes I_m)R W^{-1}R'(x \otimes I_m).
Note that the focus in ldt
is model uncertainty and for more sophisticated implementations of the FGLS estimator, you may consider using other packages such as systemfit
.
Finally, note that the main purpose of exporting this method is to show the inner calculations of the search process in search.sur function.
References
Greene WH (2020).
Econometric analysis, 8th edition.
Pearson Education Limited, New York.
ISBN 9781292231136.
Lütkepohl H (2005).
New introduction to multiple time series analysis.
Springer, Berlin.
ISBN 3540401725, doi:10.1007/978-3-540-27752-1.
Trefethen LN, Bau D (1997).
Numerical linear algebra.
Society for Industrial and Applied Mathematics.
ISBN 9780898714876.
See Also
Examples
# Example 1 (simulation, small model):
set.seed(123)
sample <- sim.sur(sigma = 2L, coef = 3L, nObs = 100)
print(sample$coef)
print(sample$sigma)
data <- data.frame(sample$y, sample$x)
# Use systemfit to estimate:
exp_names <- paste0(colnames(sample$x), collapse = " + ")
fmla <- lapply(1:ncol(sample$y), function(i) as.formula(paste0("Y", i, " ~ -1 +", exp_names)))
fit <- systemfit::systemfit(fmla, data = data, method = "SUR")
print(fit)
# Use 'ldt::estim.sur' function
fit <- estim.sur(data = get.data(cbind(sample$y, sample$x),
endogenous = ncol(sample$y),
addIntercept = FALSE))
# or, by using formula list:
fit <- estim.sur(data = get.data(data = data,
equations = fmla,
addIntercept = FALSE))
print(fit)
print(fit$estimations$sigma)
plot_data <- plot(fit, equation = 1)
# Example 2 (simulation, large model with significancy search):
num_obs <- 100
sample <- sim.sur(sigma = 2L, coef = 3L, nObs = num_obs)
print(sample$coef)
# create irrelevant data:
num_x_ir <- 20
x_ir <- matrix(rnorm(num_obs * num_x_ir), ncol = num_x_ir)
data_x <- data.frame(sample$x, x_ir)
colnames(data_x) <- c(colnames(sample$x), paste0("z", 1:num_x_ir))
fit <- estim.sur(data = get.data(cbind(sample$y, data_x),
endogenous = ncol(sample$y),
addIntercept = FALSE),
searchSigMaxIter = 100,
searchSigMaxProb = 0.05)
print(fit$estimations$coefs)
# coefficient matrix, with lots of zero restrictions
# Example 3 (simulation, large model with PCA):
# by using data of the previous example
fit <- estim.sur(data = get.data(cbind(sample$y, data_x),
endogenous = ncol(sample$y),
addIntercept = FALSE),
pcaOptionsX = get.options.pca(2,4))
print(fit$estimations$coefs)
# coefficients are: intercept and the first exogenous variable and 4 PCs