R: Perform semi-nonparametric selection model estimation

hpaSelection {hpa}

R Documentation

Perform semi-nonparametric selection model estimation

Description

This function performs semi-nonparametric (SNP) maximum likelihood estimation of sample selection model using Hermite polynomial based approximating function proposed by Gallant and Nychka in 1987. Please, see dhpa 'Details' section to get more information concerning this approximating function.

Usage

hpaSelection(
  selection,
  outcome,
  data,
  selection_K = 1L,
  outcome_K = 1L,
  pol_elements = 3L,
  is_Newey = FALSE,
  x0 = numeric(0),
  is_Newey_loocv = FALSE,
  cov_type = "sandwich",
  boot_iter = 100L,
  is_parallel = FALSE,
  opt_type = "optim",
  opt_control = NULL,
  is_validation = TRUE
)

Arguments

`selection`	an object of class "formula" (or one that can be coerced to that class): a symbolic description of the selection equation form. All variables in `selection` should be numeric vectors of the same length.
`outcome`	an object of class "formula" (or one that can be coerced to that class): a symbolic description of the outcome equation form. All variables in `outcome` should be numeric vectors of the same length.
`data`	data frame containing the variables in the model.
`selection_K`	non-negative integer representing polynomial degree related to selection equation.
`outcome_K`	non-negative integer representing polynomial degree related to outcome equation.
`pol_elements`	number of conditional expectation approximating terms for Newey's method. If `is_Newey_loocv` is `TRUE` then determines maximum number of these terms during leave-one-out cross-validation.
`is_Newey`	logical; if TRUE then returns only Newey's method estimation results (default value is FALSE).
`x0`	numeric vector of optimization routine initial values. Note that `x0 = c(pol_coefficients[-1], mean, sd, z_coef, y_coef)`.
`is_Newey_loocv`	logical; if TRUE then number of conditional expectation approximating terms for Newey's method will be selected based on leave-one-out cross-validation criteria iterating through 0 to pol_elements number of these terms.
`cov_type`	character determining the type of covariance matrix to be returned and used for summary. If `cov_type = "hessian"` then negative inverse of Hessian matrix will be applied. If `cov_type = "gop"` then inverse of Jacobian outer products will be used. If `cov_type = "sandwich"` (default) then sandwich covariance matrix estimator will be applied. If `cov_type = "bootstrap"` then bootstrap with `boot_iter` iterations will be used. If `cov_type = "hessianFD"` or `cov_type = "sandwichFD"` then (probably) more accurate but computationally demanding central difference Hessian approximation will be calculated for the inverse Hessian and sandwich estimators correspondingly. Central differences are computed via analytically provided gradient. This Hessian matrix estimation approach seems to be less accurate than BFGS approximation if polynomial order is high (usually greater then 5).
`boot_iter`	the number of bootstrap iterations for `cov_type = "bootstrap"` covariance matrix estimator type.
`is_parallel`	if `TRUE` then multiple cores will be used for some calculations. It usually provides speed advantage for large enough samples (about more than 1000 observations).
`opt_type`	string value determining the type of the optimization routine to be applied. The default is `"optim"` meaning that BFGS method from the `optim` function will be applied. If `opt_type = "GA"` then `ga` function will be additionally applied.
`opt_control`	a list containing arguments to be passed to the optimization routine depending on `opt_type` argument value. Please see details to get additional information.
`is_validation`	logical value indicating whether function input arguments should be validated. Set it to `FALSE` for slight performance boost (default value is `TRUE`).

Details

Densities Hermite polynomial approximation approach has been proposed by A. Gallant and D. W. Nychka in 1987. The main idea is to approximate unknown distribution density with scaled Hermite polynomial. For more information please refer to the literature listed below.

Let's use notations introduced in dhpa 'Details' section. Function hpaSelection maximizes the following quasi log-likelihood function:

\ln L(\gamma, \beta, \alpha, \mu, \sigma; x) = \sum\limits_{i:z_{i}=1} \ln\left(\overline{F}_{\left(\xi_{1}|\xi_{2}=y_{i}-x_{i}^{o}\beta\right)} \left(-\gamma x_{i}^{s}, \infty;\alpha, \mu, \sigma\right)\right) f_{\xi_{2}}\left(y_{i}-x_{i}^{o}\beta\right)+

+\sum\limits_{i:z_{i}=0} \ln\left(\overline{F}_{\xi} (-\infty, -x_{i}^{s}\gamma;\alpha, \mu, \sigma)\right),

where (in addition to previously defined notations):

x_{i}^{s} - is row vector of selection equation regressors derived from data according to selection formula.

x_{i}^{o} - is row vector of outcome equation regressors derived from data according to outcome formula.

\gamma - is column vector of selection equation regression coefficients (constant will not be added by default).

\beta - is column vector of outcome equation regression coefficients (constant will not be added by default).

z_{i} - binary (0 or 1) dependent variable defined in selection formula.

y_{i} - continuous dependent variable defined in outcome formula.

Note that \xi is two dimensional and selection_K corresponds to K_{1} while outcome_K determines K_{2}.

The first polynomial coefficient (zero powers) set to 1 for identification purposes i.e. \alpha_{0}=1.

Rows in data corresponding to variables mentioned in selection and outcome formulas which have at least one NA value will be ignored. The exception is continues dependent variable y which may have NA values for observation where z_{i}=0.

Note that coefficient for the first independent variable in selection will be fixed to 1 i.e. \gamma_{1}=1.

All variables mentioned in selection and outcome should be numeric vectors.

The function calculates standard errors via sandwich estimator and significance levels are reported taking into account quasi maximum likelihood estimator (QMLE) asymptotic normality. If one wants to switch from QMLE to semi-nonparametric estimator (SNPE) during hypothesis testing then covariance matrix should be estimated again using bootstrap.

Initial values for optimization routine are obtained by Newey's method (see the reference below). In order to obtain initial values via least squares please, set pol_elements = 0. Initial values for the outcome equation are obtained via hpaBinary function setting K to selection_K.

Note that selection equation dependent variables should have exactly two levels (0 and 1) where "0" states for the selection results which leads to unobservable values of dependent variable in outcome equation.

This function maximizes (quasi) log-likelihood function via optim function setting its method argument to "BFGS". If opt_type = "GA" then genetic algorithm from ga function will be additionally (after optim putting its solution (par) into suggestions matrix) applied in order to perform global optimization. Note that global optimization takes much more time (usually minutes but sometimes hours or even days). The number of iterations and population size of the genetic algorithm will grow linearly along with the number of estimated parameters. If it seems that global maximum has not been found then it is possible to continue the search restarting the function setting its input argument x0 to x1 output value. Note that if cov_type = "bootstrap" then ga function will not be used for bootstrap iterations since it may be extremely time consuming.

If opt_type = "GA" then opt_control should be the list containing the values to be passed to ga function. It is possible to pass arguments lower, upper, popSize, pcrossover, pmutation, elitism, maxiter, suggestions, optim, optimArgs, seed and monitor. Note that it is possible to set population, selection, crossover and mutation arguments changing ga default parameters via gaControl function. These arguments information reported in ga. In order to provide manual values for lower and upper bounds please follow parameters ordering mentioned above for the x0 argument. If these bounds are not provided manually then they (except those related to the polynomial coefficients) will depend on the estimates obtained by local optimization via optim function (this estimates will be in the middle between lower and upper). Specifically for each sd parameter lower (upper) bound is 5 times lower (higher) than this parameter optim estimate. For each mean and regression coefficient parameter its lower and upper bounds deviate from corresponding optim estimate by two absolute values of this estimate. Finally, lower and upper bounds for each polynomial coefficient are -10 and 10 correspondingly (do not depend on their optim estimates).

The following arguments are differ from their defaults in ga:

pmutation = 0.2,
optim = TRUE,
optimArgs = list("method" = "Nelder-Mead", "poptim" = 0.2, "pressel" = 0.5),
seed = 8,
elitism = 2 + round(popSize * 0.1).

Let's denote by n_reg the number of regressors included into the selection and outcome formulas. The arguments popSize and maxiter of ga function have been set proportional to the number of estimated polynomial coefficients and independent variables:

popSize = 10 + 5 * (z_K + 1) * (y_K + 1) + 2 * n_reg
maxiter = 50 * (z_K + 1) * (y_K + 1) + 10 * n_reg

Value

This function returns an object of class "hpaSelection".

An object of class "hpaSelection" is a list containing the following components:

optim - optim function output. If opt_type = "GA" then it is the list containing optim and ga functions outputs.
x1 - numeric vector of distribution parameters estimates.
Newey - list containing information concerning Newey's method estimation results.
selection_mean - estimate of the hermite polynomial mean parameter related to selection equation random error marginal distribution.
outcome_mean - estimate of the hermite polynomial mean parameter related to outcome equation random error marginal distribution.
selection_sd - estimate of sd parameter related to selection equation random error marginal distribution.
outcome_sd - estimate of the hermite polynomial sd parameter related to outcome equation random error marginal distribution.
pol_coefficients - polynomial coefficients estimates.
pol_degrees - numeric vector which first element is selection_K and the second is outcome_K.
selection_coef - selection equation regression coefficients estimates.
outcome_coef - outcome equation regression coefficients estimates.
cov_mat - covariance matrix estimate.
results - numeric matrix representing estimation results.
log-likelihood - value of Log-Likelihood function.
re_moments - list which contains information about random errors expectations, variances and correlation.
data_List - list containing model variables and their partition according to outcome and selection equations.
n_obs - number of observations.
ind_List - list which contains information about parameters indexes in x1.
selection_formula - the same as selection input parameter.
outcome_formula - the same as outcome input parameter.

Abovementioned list Newey has class "hpaNewey" and contains the following components:

outcome_coef - regression coefficients estimates (except constant term which is part of conditional expectation approximating polynomial).
selection_coef - regression coefficients estimates related to selection equation.
constant_biased - biased estimate of constant term.
inv_mills - inverse mills ratios estimates and their powers (including constant).
inv_mills_coef - coefficients related to inv_mills.
pol_elements - the same as pol_elements input parameter. However if is_Newey_loocv is TRUE then it will equal to the number of conditional expectation approximating terms for Newey's method which minimize leave-one-out cross-validation criteria.
outcome_exp_cond - dependent variable conditional expectation estimates.
selection_exp - selection equation random error expectation estimate.
selection_var - selection equation random error variance estimate.
hpaBinaryModel - object of class "hpaBinary" which contains selection equation estimation results.

Abovementioned list re_moments contains the following components:

selection_exp - selection equation random errors expectation estimate.
selection_var - selection equation random errors variance estimate.
outcome_exp - outcome equation random errors expectation estimate.
outcome_var - outcome equation random errors variance estimate.
errors_covariance - outcome and selection equation random errors covariance estimate.
rho - outcome and selection equation random errors correlation estimate.
rho_std - outcome and selection equation random errors correlation estimator standard error estimate.

References

A. Gallant and D. W. Nychka (1987) <doi:10.2307/1913241>

W. K. Newey (2009) <https://doi.org/10.1111/j.1368-423X.2008.00263.x>

Mroz T. A. (1987) <doi:10.2307/1911029>

Examples


## Let's estimate wage equation accounting for non-random selection.
## See the reference to Mroz TA (1987) to get additional details about
## the data this examples use

# Prepare data
library("sampleSelection")
data("Mroz87")
h = data.frame("kids" = as.numeric(Mroz87$kids5 + Mroz87$kids618 > 0),
	"age" = as.numeric(Mroz87$age),
	"faminc" = as.numeric(Mroz87$faminc),
	"educ" = as.numeric(Mroz87$educ),
	"exper" = as.numeric(Mroz87$exper),
	"city" = as.numeric(Mroz87$city),
	"wage" = as.numeric(Mroz87$wage),
	"lfp" = as.numeric(Mroz87$lfp))
	
# Estimate model parameters
model <- hpaSelection(selection = lfp ~ educ + age + I(age ^ 2) + 
                                        kids + log(faminc),
                      outcome = log(wage) ~ exper + I(exper ^ 2) + 
                                            educ + city,
                                  selection_K = 2, outcome_K = 3, 
                                  data = h, 
                                  pol_elements = 3, is_Newey_loocv = TRUE)
summary(model)

# Plot outcome equation random errors density
plot(model, type = "outcome")
# Plot selection equation random errors density
plot(model, type = "selection")


## Estimate semi-nonparametric sample selection model
## parameters on simulated data given chi-squared random errors


set.seed(100)
library("mvtnorm")

# Sample size

n <- 1000

# Simulate independent variables
X_rho <- 0.5
X_sigma <- matrix(c(1, X_rho, X_rho,
                    X_rho, 1, X_rho, 
                    X_rho,X_rho,1), 
                  ncol=3)
X <- rmvnorm(n=n, mean = c(0,0,0), 
             sigma = X_sigma)

# Simulate random errors
epsilon <- matrix(0, n, 2)
epsilon_z_y <- rchisq(n, 5)
epsilon[, 1] <- (rchisq(n, 5) + epsilon_z_y) * (sqrt(3/20)) - 3.8736
epsilon[, 2] <- (rchisq(n, 5) + epsilon_z_y) * (sqrt(3/20)) - 3.8736
# Simulate selection equation
z_star <- 1 + 1 * X[,1] + 1 * X[,2] + epsilon[,1]
z <- as.numeric((z_star > 0))

# Simulate outcome equation
y_star <- 1 + 1 * X[,1] + 1 * X[,3] + epsilon[,2]
z <- as.numeric((z_star > 0))
y <- y_star
y[z==0] <- NA
h <- as.data.frame(cbind(z, y, X))
names(h) <- c("z", "y", "x1", "x2", "x3")

# Estimate parameters
model <- hpaSelection(selection = z ~ x1 + x2, 
                      outcome = y ~ x1 + x3,
                      data = h, 
                      selection_K = 1, outcome_K = 3)
summary(model)

# Get conditional predictions for outcome equation
model_pred_c <- predict(model, is_cond = TRUE)
# Conditional predictions y|z=1
model_pred_c$y_1
# Conditional predictions y|z=0
model_pred_c$y_0

# Get unconditional predictions for outcome equation
model_pred_u <- predict(model, is_cond = FALSE)
model_pred_u$y

# Get conditional predictions for selection equation
# Note that for z=0 these predictions are NA
predict(model, is_cond = TRUE, type = "selection")
# Get unconditional predictions for selection equation
predict(model, is_cond = FALSE, type = "selection")

[Package hpa version 1.3.3 Index]