hpaBinary {hpa} | R Documentation |
Semi-nonparametric single index binary choice model estimation
Description
This function performs semi-nonparametric (SNP) maximum
likelihood estimation of single index binary choice model
using Hermite polynomial based approximating function proposed by Gallant
and Nychka in 1987. Please, see dhpa
'Details' section to
get more information concerning this approximating function.
Usage
hpaBinary(
formula,
data,
K = 1L,
mean_fixed = NA_real_,
sd_fixed = NA_real_,
constant_fixed = 0,
coef_fixed = TRUE,
is_x0_probit = TRUE,
is_sequence = FALSE,
x0 = numeric(0),
cov_type = "sandwich",
boot_iter = 100L,
is_parallel = FALSE,
opt_type = "optim",
opt_control = NULL,
is_validation = TRUE
)
Arguments
formula |
an object of class "formula"
(or one that can be coerced to that class):
a symbolic description of the model to be fitted.
All variables in |
data |
data frame containing the variables in the model. |
K |
non-negative integer representing polynomial degree (order). |
mean_fixed |
numeric value for binary choice
equation random error density mean parameter.
Set it to |
sd_fixed |
numeric value for binary choice equation random error
density |
constant_fixed |
numeric value for binary choice
equation constant parameter. Set it to |
coef_fixed |
logical value indicating whether binary
equation first independent variable coefficient should be fixed
( |
is_x0_probit |
logical; if |
is_sequence |
if TRUE then function calculates models with polynomial degrees from 0 to K each time using initial values obtained from the previous step. In this case function will return the list of models where i-th list element correspond to model calculated under K=(i-1). |
x0 |
numeric vector of optimization routine initial values.
Note that |
cov_type |
character determining the type of covariance matrix to be
returned and used for summary. If |
boot_iter |
the number of bootstrap iterations
for |
is_parallel |
if |
opt_type |
string value determining the type of the optimization
routine to be applied. The default is |
opt_control |
a list containing arguments to be passed to the
optimization routine depending on |
is_validation |
logical value indicating whether function input
arguments should be validated. Set it to |
Details
Densities Hermite polynomial approximation approach has been proposed by A. Gallant and D. W. Nychka in 1987. The main idea is to approximate unknown distribution density with scaled Hermite polynomial. For more information please refer to the literature listed below.
Let's use notations introduced in dhpa
'Details'
section. Function hpaBinary
maximizes the following
quasi log-likelihood function:
\ln L(\gamma_{0}, \gamma, \alpha, \mu, \sigma; x) =
\sum\limits_{i:z_{i}=1}
\ln\left(\overline{F}_{\xi}
(-(\gamma_{0}+\gamma x_{i}), \infty;\alpha, \mu, \sigma)\right) +
+\sum\limits_{i:z_{i}=0}
\ln\left(\overline{F}_{\xi}
(-\infty, -(\gamma_{0} + x_{i}\gamma);\alpha, \mu, \sigma)\right),
where (in addition to previously defined notations):
x_{i}
- is row vector of regressors derived from data
according to formula
.
\gamma
- is column vector of regression coefficients.
\gamma_{0}
- constant.
z_{i}
- binary (0 or 1) dependent variable defined in formula
.
Note that \xi
is one dimensional and K
corresponds
to K=K_{1}
.
The first polynomial coefficient (zero powers)
set to 1 for identification purposes i.e. \alpha_{0}=1
.
If coef_fixed
is TRUE
then the coefficient for the
first independent variable in formula
will be fixed to 1 i.e.
\gamma_{1}=1
.
If mean_fixed
is not NA
then \mu
=mean_fixed
fixed.
If sd_fixed
is not NA
then \sigma
=mean_fixed
fixed. However if is_x0_probit = TRUE
then parameter \sigma
will
be scale adjusted in order to provide better initial point for optimization
routine. Please, extract \sigma
adjusted value from the function's
output list. The same is for mean_fixed
.
Rows in data
corresponding to variables mentioned in formula
which have at least one NA
value will be ignored.
All variables mentioned in formula
should be numeric vectors.
The function calculates standard errors via sandwich estimator and significance levels are reported taking into account quasi maximum likelihood estimator (QMLE) asymptotic normality. If one wants to switch from QMLE to semi-nonparametric estimator (SNPE) during hypothesis testing then covariance matrix should be estimated again using bootstrap.
This function maximizes (quasi) log-likelihood function
via optim
function setting its method
argument to "BFGS". If opt_type = "GA"
then genetic
algorithm from ga
function
will be additionally (after optim
putting its
solution (par
) into suggestions
matrix) applied in order to
perform global optimization. Note that global optimization takes
much more time (usually minutes but sometimes hours or even days).
The number of iterations and population size of the genetic algorithm
will grow linearly along with the number of estimated parameters.
If it seems that global maximum has not been found then it
is possible to continue the search restarting the function setting
its input argument x0
to x1
output value. Note that
if cov_type = "bootstrap"
then ga
function will not be used for bootstrap iterations since it
may be extremely time consuming.
If opt_type = "GA"
then opt_control
should be the
list containing the values to be passed to ga
function. It is possible to pass arguments lower
, upper
,
popSize
, pcrossover
, pmutation
, elitism
,
maxiter
, suggestions
, optim
, optimArgs
,
seed
and monitor
.
Note that it is possible to set population
,
selection
, crossover
and mutation
arguments changing
ga
default parameters via gaControl
function. These arguments information reported in ga
.
In order to provide manual values for lower
and upper
bounds
please follow parameters ordering mentioned above for the
x0
argument. If these bounds are not provided manually then
they (except those related to the polynomial coefficients)
will depend on the estimates obtained
by local optimization via optim
function
(this estimates will be in the middle
between lower
and upper
).
Specifically for each sd parameter lower
(upper
) bound
is 5 times lower (higher) than this
parameter optim
estimate.
For each mean and regression coefficient parameter its lower and
upper bounds deviate from corresponding optim
estimate
by two absolute values of this estimate.
Finally, lower and upper bounds for each polynomial
coefficient are -10
and 10
correspondingly (do not depend
on their optim
estimates).
The following arguments are differ from their defaults in
ga
:
-
pmutation = 0.2
, -
optim = TRUE
, -
optimArgs = list("method" = "Nelder-Mead", "poptim" = 0.2, "pressel" = 0.5)
, -
seed = 8
, -
elitism = 2 + round(popSize * 0.1)
.
Let's denote by n_reg
the number of regressors
included into the formula
.
The arguments popSize
and maxiter
of
ga
function have been set proportional to the number of
estimated polynomial coefficients and independent variables:
-
popSize = 10 + 5 * (K + 1) + 2 * n_reg
-
maxiter = 50 * (1 + K) + 10 * n_reg
Value
This function returns an object of class "hpaBinary".
An object of class "hpaBinary" is a list containing the
following components:
-
optim
-optim
function output. Ifopt_type = "GA"
then it is the list containingoptim
andga
functions outputs. -
x1
- numeric vector of distribution parameters estimates. -
mean
- mean (mu) parameter of density function estimate. -
sd
- sd (sigma) parameter of density function estimate. -
pol_coefficients
- polynomial coefficients estimates. -
pol_degrees
- the same asK
input parameter. -
coefficients
- regression (single index) coefficients estimates. -
cov_mat
- covariance matrix estimate. -
marginal_effects
- marginal effects matrix where columns are variables and rows are observations. -
results
- numeric matrix representing estimation results. -
log-likelihood
- value of Log-Likelihood function. -
AIC
- AIC value. -
errors_exp
- random error expectation estimate. -
errors_var
- random error variance estimate. -
dataframe
- data frame containing variables mentioned informula
withoutNA
values. -
model_Lists
- lists containing information about fixed parameters and parameters indexes inx1
. -
n_obs
- number of observations. -
z_latent
- latent variable (single index) estimates. -
z_prob
- probabilities of positive outcome (i.e. 1) estimates.
References
A. Gallant and D. W. Nychka (1987) <doi:10.2307/1913241>
See Also
summary.hpaBinary, predict.hpaBinary, plot.hpaBinary, logLik.hpaBinary
Examples
## Estimate survival probability on Titanic
library("titanic")
# Prepare data set converting
# all variables to numeric vectors
h <- data.frame("male" = as.numeric(titanic_train$Sex == "male"))
h$class_1 <- as.numeric(titanic_train$Pclass == 1)
h$class_2 <- as.numeric(titanic_train$Pclass == 2)
h$class_3 <- as.numeric(titanic_train$Pclass == 3)
h$sibl <- titanic_train$SibSp
h$survived <- titanic_train$Survived
h$age <- titanic_train$Age
h$parch <- titanic_train$Parch
h$fare <- titanic_train$Fare
# Estimate model parameters
model_hpa_1 <- hpaBinary(survived ~class_1 + class_2 +
male + age + sibl + parch + fare,
K = 3, data = h)
#get summary
summary(model_hpa_1)
# Get predicted probabilities
pred_hpa_1 <- predict(model_hpa_1)
# Calculate number of correct predictions
hpa_1_correct_0 <- sum((pred_hpa_1 < 0.5) &
(model_hpa_1$dataframe$survived == 0))
hpa_1_correct_1 <- sum((pred_hpa_1 >= 0.5) &
(model_hpa_1$dataframe$survived == 1))
hpa_1_correct <- hpa_1_correct_1 + hpa_1_correct_0
# Plot random errors density approximation
plot(model_hpa_1)
## Estimate parameters on data simulated from Student distribution
library("mvtnorm")
set.seed(123)
# Simulate independent variables from normal distribution
n <- 5000
X <- rmvnorm(n=n, mean = c(0,0),
sigma = matrix(c(1,0.5,0.5,1), ncol=2))
# Simulate random errors from Student distribution
epsilon <- rt(n, 5) * (3 / sqrt(5))
# Calculate latent and observable variables values
z_star <- 1 + X[, 1] + X[, 2] + epsilon
z <- as.numeric((z_star > 0))
# Store the results into data frame
h <- as.data.frame(cbind(z,X))
names(h) <- c("z", "x1", "x2")
# Estimate model parameters
model <- hpaBinary(formula = z ~ x1 + x2, data=h, K = 3)
summary(model)
# Get predicted probabilities of 1 values
predict(model)
# Plot density function approximation
plot(model)