R: Estimate a SUR Model

estim.sur {ldt}

R Documentation

Estimate a SUR Model

Description

Use this function to estimate a Seemingly Unrelated Regression model.

Usage

estim.sur(
  data,
  searchSigMaxIter = 0,
  searchSigMaxProb = 0.1,
  restriction = NULL,
  pcaOptionsY = NULL,
  pcaOptionsX = NULL,
  simFixSize = 0,
  simTrainFixSize = 0,
  simTrainRatio = 0.75,
  simSeed = 0,
  simMaxConditionNumber = Inf
)

Arguments

`data`	A list that determines data and other required information for the search process. Use `get.data()` function to generate it from a `matrix` or a `data.frame`.
`searchSigMaxIter`	An integer for the maximum number of iterations in searching for significant coefficients. Use 0 to disable the search.
`searchSigMaxProb`	A number for the maximum value of type I error to be used in searching for significant coefficients. If p-value is less than this, it is interpreted as significant.
`restriction`	A `km x q` matrix of restrictions where `m` is the number of endogenous data, `k` is the number of exogenous data, and `q` is the number of unrestricted coefficients.
`pcaOptionsY`	A list of options to use principal components of the endogenous data, instead of the actual values. Set `NULL` to disable. Use `get.options.pca()` for initialization.
`pcaOptionsX`	A list of options to use principal components of the exogenous data, instead of the actual values. Set `NULL` to disable. Use `get.options.pca()` for initialization.
`simFixSize`	An integer that determines the number of out-of-sample simulations. Use zero to disable the simulation.
`simTrainFixSize`	An integer representing the number of data points in the training sample in the out-of-sample simulation. If zero, `trainRatio` will be used.
`simTrainRatio`	A number representing the size of the training sample relative to the available size, in the out-of-sample simulation. It is effective if `trainFixSize` is zero.
`simSeed`	A seed for the random number generator. Use zero for a random value.
`simMaxConditionNumber`	A number for the maximum value for the condition number in the simulation.

Details

As described in section 10.2 in Greene (2020), this type of statistical model consists of multiple regression equations, where each equation may have a different set of exogenous variables and the disturbances between the equations are assumed to be correlated. The general form with m equations can be written as y_i=z_i'\gamma_i+v_i and E(v_i v_j)=\sigma_{ij}^2 for i=1,\ldots m. Assuming that a sample of N independent observations is available, we can stack the observations and use the following system for estimation:

Y = X B + V, \quad \mathrm{vec}B = R\gamma,

where the columns of Y:N \times m contain the endogenous variables for each equation and the columns of X: N\times k contain the explanatory variables, with k being the number of unique explanatory variables in all equations. Note that X combines the z_i variables, and the restrictions imposed by R:mk\times q and \gamma:q\times 1 determine a set of zero constraints on B: k \times m, resulting in a system of equations with different sets of exogenous variables.

Imposing restrictions on the model using the R matrix is not user-friendly, but it is suitable for use in this package, as users are not expected to specify such restrictions, but only to provide a list of potential regressors. Note that in this package, most procedures, including significance search, are supposed to be automated.

The unrestricted estimators (i.e., \hat{B}=(X'X)^{-1}X'Y, and \hat{\Sigma}=(\hat{V}'\hat{V})/N where \hat{V}=Y-X\hat{B}) are used to initialize the feasible GLS estimators:

\tilde{B} = RW^{-1}R'[\hat{V}-1 \otimes x']\mathrm{vec}Y, \quad \tilde{\Sigma}=(\tilde{V}'\tilde{V})/N,

where W = R'[\hat{V}^{-1} \otimes X'X]R and \tilde{V}=Y-X\tilde{B}. The properties of these estimators are discussed in proposition 5.3 in Lütkepohl (2005). See also section 10.2 in Greene (2020). The maximum likelihood value is calculated by -\frac{N}{2}(m(\ln 2\pi+1)+\ln|\tilde{\Sigma}|). The condition number is calculated by multiplying 1-norm of W and its inverse (e.g., see page 94 in Trefethen and Bau (1997)). Furthermore, given an out-of-sample observation such as x:k\times 1, the prediction is y^f = \tilde{B}'x, and its variance is estimated by the following formula:

\mathrm{var}y^f = \tilde{V} + (x' \otimes I_m)R W^{-1}R'(x \otimes I_m).

Note that the focus in ldt is model uncertainty and for more sophisticated implementations of the FGLS estimator, you may consider using other packages such as systemfit.

Finally, note that the main purpose of exporting this method is to show the inner calculations of the search process in search.sur function.

References

Greene WH (2020). Econometric analysis, 8th edition. Pearson Education Limited, New York. ISBN 9781292231136.

Lütkepohl H (2005). New introduction to multiple time series analysis. Springer, Berlin. ISBN 3540401725, doi:10.1007/978-3-540-27752-1.

Trefethen LN, Bau D (1997). Numerical linear algebra. Society for Industrial and Applied Mathematics. ISBN 9780898714876.

Examples

# Example 1 (simulation, small model):
set.seed(123)
sample <- sim.sur(sigma = 2L, coef = 3L, nObs = 100)
print(sample$coef)
print(sample$sigma)

data <- data.frame(sample$y, sample$x)

#    Use systemfit to estimate:
exp_names <- paste0(colnames(sample$x), collapse = " + ")
fmla <- lapply(1:ncol(sample$y), function(i) as.formula(paste0("Y", i, " ~ -1 +", exp_names)))
fit <- systemfit::systemfit(fmla, data = data, method = "SUR")
print(fit)

#    Use 'ldt::estim.sur' function
fit <- estim.sur(data = get.data(cbind(sample$y, sample$x),
                                  endogenous = ncol(sample$y),
                                  addIntercept = FALSE))

# or, by using formula list:
fit <- estim.sur(data = get.data(data = data,
                                 equations = fmla,
                                 addIntercept = FALSE))

print(fit)
print(fit$estimations$sigma)
plot_data <- plot(fit, equation = 1)


# Example 2 (simulation, large model with significancy search):
num_obs <- 100
sample <- sim.sur(sigma = 2L, coef = 3L, nObs = num_obs)
print(sample$coef)

#   create irrelevant data:
num_x_ir <- 20
x_ir <- matrix(rnorm(num_obs * num_x_ir), ncol = num_x_ir)
data_x <- data.frame(sample$x, x_ir)
colnames(data_x) <- c(colnames(sample$x), paste0("z", 1:num_x_ir))

fit <- estim.sur(data = get.data(cbind(sample$y, data_x),
                                 endogenous = ncol(sample$y),
                                 addIntercept = FALSE),
                 searchSigMaxIter = 100,
                 searchSigMaxProb = 0.05)

print(fit$estimations$coefs)
# coefficient matrix, with lots of zero restrictions

# Example 3 (simulation, large model with PCA):
#   by using data of the previous example
fit <- estim.sur(data = get.data(cbind(sample$y, data_x),
                                 endogenous = ncol(sample$y),
                                 addIntercept = FALSE),
                 pcaOptionsX = get.options.pca(2,4))
print(fit$estimations$coefs)
#  coefficients are: intercept and the first exogenous variable and 4 PCs

[Package ldt version 0.5.3 Index]