mnprobit {switchSelection} | R Documentation |
Multinomial probit model
Description
This function estimates parameters of multinomial probit model and sample selection model with continuous outcome and multinomial probit selection mechanism.
Usage
mnprobit(
formula,
formula2 = NULL,
data,
regimes = NULL,
opt_type = "optim",
opt_args = NULL,
start = NULL,
estimator = "ml",
cov_type = "sandwich",
degrees = NULL,
n_sim = 1000,
n_cores = 1,
control = NULL,
regularization = NULL
)
Arguments
formula |
an object of class "formula" corresponding to multinomial (selection) equation. |
formula2 |
an object of class "formula" corresponding to continuous (outcome) equation. |
data |
data frame containing the variables in the model. |
regimes |
numeric vector such that |
opt_type |
character representing optimization function to be used.
If |
opt_args |
a list of input arguments for the optimization function
selected via |
start |
numeric vector of initial parameters' values. It will be used
as a starting point for optimization purposes. It is also possible to
provide an object of class |
estimator |
character determining estimation method.
If |
cov_type |
character determining the type of covariance matrix to be
returned and used for summary. If |
degrees |
vector of non-negative integers such that |
n_sim |
integer representing the number of GHK draws when there are more than 3 ordered equations. Otherwise alternative (much more efficient) algorithms will be used to calculate multivariate normal probabilities. |
n_cores |
positive integer representing the number of CPU cores used for parallel computing. If possible it is highly recommend to set it equal to the number of available physical cores especially when the system of ordered equations has 2 or 3 equations. |
control |
a list of control parameters. See 'Details'. |
regularization |
a list of control parameters for regularization.
Element |
Details
For identification purposes the following parametrization of the multinomial probit model is used:
z_{ji}^{*} = w_{i}\gamma_{j} + u_{ji}, \quad z_{Ji}^{*} = 0,
i\in\{1,2,...,n\},\quad j\in\{1,2,...,J-1\},
z_{i}=\underset{t\in\{1,...,J\}}{\text{argmax} }\text{ }z_{ti}^{*}, \qquad
u_{i} = (u_{1i},u_{2i},...,u_{(J-1)i})^{T},
u_{i}\sim N\left(\begin{bmatrix}0\\ \vdots\\ 0\end{bmatrix},
\Sigma\right), i.i.d.,
\Sigma = \begin{bmatrix}
1 & \sigma_{12}& \sigma_{13} & ... & \sigma_{1(J-1)}\\
\sigma_{12} & \sigma_{2}^2 & \sigma_{23} & ... & \sigma_{2(J-1)}\\
\sigma_{13} & \sigma_{23} & \sigma_{3}^2 & ... & \sigma_{3(J-1)}\\
\vdots & \vdots & \vdots & \ddots & \vdots\\
\sigma_{1(J-1)} & \sigma_{2(J-1)} & \sigma_{3(J-1)} & ... & \sigma_{J-1}^2
\end{bmatrix}.
Where:
-
J
- the number of alternatives. -
z_{ji}^{*}
- unobservable (latent) value of thej
-th alternative. Usuallyz_{ji}^{*}
is interpreted as a utility of thej
-th alternative. -
z_{i}
- selected alternative. -
w_{i}
- regressors that should be described informula
. Regressors are assumed to be the same for all alternatives. -
\gamma_{j}
- regression coefficients of thej
-th alternative's equation. -
w_{i}\gamma_{j}
- linear index of thej
-th alternative's equation. -
u_{i}
- multivariate normal random vector which elements are normal random variables. -
\Sigma
- covariance matrix ofu_{i}
.
Note that alternatives z_{i}
should be represented in data
as
integers starting from 1
(not 0
).
It is also possible to account for multinomial sample selection and
endogenous switching. Consider a simple example.
Suppose that there is a sample containing
information on wages of individuals. Let's denote wages of people who
are working in information technologies (IT) sector and of those who are not
by y_{1i}
and y_{0i}
correspondingly. The effect of various
characteristics x_{i}
on y_{0i}
and y_{1i}
may differ.
For example programming skills probably have a greater impact on y_{1i}
than on y_{0i}
. So there is different equations (regimes) for these
wages:
y_{0i} = x_{i}\beta_{0} + \varepsilon_{0i}\\
y_{1i} = x_{i}\beta_{1} + \varepsilon_{1i}\\
\varepsilon_{i}=\left(\varepsilon_{0i}, \varepsilon_{1i}\right)
\text{, i.i.d.}
where \beta_{0}
is a vector of regression coefficients representing
the effect of individual characteristics x_{i}
on wage y_{0i}
.
Similarly for \beta_{1}
.
Herewith there is non-random selection into IT
sector. Suppose that z_{i}=1
if individual is working in IT sector,
z_{i}=2
if individual is employed in non-IT sector, and
z_{i}=3
if individual is unemployed.
So observable wage is:
y_{i} =
\begin{cases}
y_{1i}\text{, if }z_{i} = 1\\
y_{0i}\text{, if }z_{i} = 2\\
\text{unobservable, if }z_{i} = 3
\end{cases}
It is assumed that joint distribution of
u_{i}
and \varepsilon_{i}
is multivariate normal.
To estimate parameters of this model it is necessarily to assign regimes
to alternatives. Note that regime[k]
corresponds to the regime of
y_{i}
when z_{i} = k
.
Therefore set regimes = c(1, 0, -1)
where
-1
is a special regime
for endogenously omitted observations. Dependent variable y_{i}
and
regressors x_{i}
should be provided via formula2
argument.
Note that in some applications several alternatives may have the same
regime.The number of regimes
will have a moderate impact on computational burden. However the
function may work extremely slow when there are more than 4
alternatives.
By default the model is estimated via maximum likelihood method. However
if estimator = "2step"
then two-step procedure proposed by
E. Kossova and B. Potanin (2022) will be used. The idea is similar to
Heckman's method i.e. to estimate the following regression equation:
y_{i} = x_{i}\beta +
\sum\limits_{t=1}^{J-1}
\rho_{t}\sigma_{t}\sigma_{\varepsilon}
\tilde{\lambda}^{(z_{i})}_{ti},
where:
\tilde{\lambda}^{(j)}_{i}=A^{(j)}\lambda^{(t)}_{i}\\
A^{(j)}_{t_{1}t_{2}} =
\begin{cases}
1\text{, if } t_{1}=j\\
-1\text{, if } t_{1}<j \text{ and } t_{1}=t_{2}\\
-1\text{, if } t_{1}>j \text{ and } t_{1}=t_{2} + 1\\
0\text{, otherwise}
\end{cases}, t_{1},t_{2}\leq J-1
\lambda^{(j)}_{i}=\lambda^{(j)}_{i}
\left(\tilde{z}_{1}^{(ji)},...,
\tilde{z}_{J-1}^{(ji)}\right)=
\nabla \ln P\left(z_{i}\right) =
\nabla \ln F_{\tilde{u}^{(ji)}}\left(\tilde{z}_{1}^{(ji)},...,
\tilde{z}_{J-1}^{(ji)}\right)=
\left(\lambda_{1i}^{(j)},...,\lambda_{(J-1)i}^{(j)} \right),\\
\tilde{u}^{(ji)} =
\left(u_{1i}-u_{ji},u_{2i}-u_{ji},...,
u_{(j-1)i}-u_{ji},u_{(j+1)i}-u_{ji},...,
u_{Ji}-u_{ji}\right),\\
\tilde{z}^{(ji)} = \left(w_{i}\left(\gamma_{j}-\gamma_{1}\right),
w_{i}\left(\gamma_{j}-\gamma_{2}\right),...,
w_{i}\left(\gamma_{j}-\gamma_{j-1}\right),
w_{i}\left(\gamma_{j}-\gamma_{j+1}\right),...,
w_{i}\left(\gamma_{j}-\gamma_{J}\right)\right),\\
u_{Ji}=0,\quad \gamma_{J}=\left(0,...,0\right).
Note that \tilde{\lambda}
are estimated on the first step using
estimates from multinomial probit model. On the second step these estimates
are used as the regressors (covariates) where \beta
and
\rho_{t}\sigma_{t}\sigma_{\varepsilon}
are estimated via least squares
method. Standard errors are estimated by approach proposed by Newey (2009).
Argument degrees
is similar to the same argument of
mvoprobit
.
Optimization always starts with optim
. If
opt_type = "gena"
or opt_type = "pso"
then
gena
or pso
is used
to proceed optimization starting
from initial point provided by optim
. Manual arguments to
optim
should be provided in a form of a list through opt_args$optim
.
Similarly opt_args$gena
and opt_args$pso
provide manual
arguments to gena
and pso
.
For example to provide Nelder-Mead optimizer to
optim
and
restrict the number of genetic algorithm iterations to 10
make
opt_args = list(optim = list(method = "Nelder-Mead"),
gena = list(maxiter = 10))
.
For more information on multivariate sample selection and endogenous switching models see E. Kossova and B. Potanin (2018), E. Kossova, L. Kupriianova, and B. Potanin (2020) and E. Kossova and B. Potanin (2022).
Function pmnorm
is used internally for calculation
of multivariate normal probabilities, densities and their derivatives.
Currently control
has no input arguments intended for the users.
This argument is used for some internal purposes of the package.
Value
This function returns an object of class 'mnprobit' which is a list containing the following elements:
-
par
- vector of parameters' estimates. -
coef
- matrix which j-the columncoef[, j]
is a vector of regression coefficients estimates of the j-th alternative equation i.e.\hat{\gamma}_{j}
. -
coef2
- matrix which j-the columncoef2[, j]
is a vector of regression coefficients estimates of the continuous equation in (j+1)-th regime. -
sigma
- estimate of the covariance matrix of random errors of alternatives equations i.e.\widehat{\Sigma}
. -
cov2
- matrix which elementcov2[i, j]
is an estimate of the covariance between random error of i-th alternative equation and random error of continuous equation in (j+1)-th regime. -
var2
- a vector such thatvar2[i]
is the estimate of the variance of the random error of continuous equation in (i+1)-the regime. -
logLik
- log-likelihood value. -
W
- numeric matrix of regressors of the system of multinomial equations. -
X
- numeric matrix of regressors of continuous equation. -
z
- numeric vector of multinomial dependent variable values. -
y
- numeric vector of continuous variable values. -
control_lnL
- some additional variables to be passed to likelihood function (not intended for users). -
formula
- the same asformula
input argument but all elements are coerced to formula type. -
formula2
- the same asformula
input argument but all elements are coerced to formula type. -
lambda
- matrix such thatlambda[i, j]
corresponds to\hat{\tilde{\lambda}}_{ji}
. -
data
- the same asdata
input argument but without missing values. -
cov
- estimate of the covariance matrix of parameters' estimator. -
cov_type
- type of the asymptotic covariance matrix estimator. -
cov_2step
- estimate of the covariance matrix of parameters' estimator associated with the second step parameters i.e. whenestimator = "2step"
. -
tbl
- special table used to create a summary (not intended for users). -
regimes
- the same asregimes
input argument or automatically generated matrix representing the structure of the system of equations. Please, see 'Details' section above for more information. -
n_regimes
- the number of regimes. -
degrees
- the same asdegrees
input argument. -
model1
- first step estimation results whenestimator = "2step"
. -
coef_lambda
- estimates of coefficients of lambdas. -
n_alt
- the number of alternatives. -
n_obs
- the number of observations. -
J
- the Jacobian of the likelihood function. -
p_value
- p-values of the tests on significance of the parameters where null hypothesis is that corresponding parameter equals zero. -
other
- list of additional variables that is not intended for the user.
It is highly recommended to get estimates via
coef.mnprobit
function.
References
W. K. Newey (2009). Two-step series estimation of sample selection models. The Econometrics Journal, vol. 12(1), pages 217-229.
E. Kossova, B. Potanin (2018). Heckman method and switching regression model multivariate generalization. Applied Econometrics, vol. 50, pages 114-143.
E. Kossova, L. Kupriianova, B. Potanin (2020). Parametric and semiparametric multivariate sample selection models estimators' accuracy: Comparative analysis on simulated data. Applied Econometrics, vol. 57, pages 119-139.
E. Kossova, B. Potanin (2022). Estimation of Gaussian multinomial endogenous switching model. Applied Econometrics, vol. 67, pages 121-143.
Examples
# -------------------------------
# CPS data example
# -------------------------------
# Set seed for reproducibility
set.seed(123)
# Upload data
data(cps)
# Prepare multinomial variable for education
cps$educ <- NA
cps$educ[cps$basic == 1] <- 1
cps$educ[cps$bachelor == 1] <- 2
cps$educ[cps$master == 1] <- 3
# Multinomial probit model for education
f_educ <- educ ~ age + I(age ^ 2) + sbachelor + smaster
model1 <- mnprobit(f_educ, data = cps)
summary(model1)
# Endogenous education treatment model
f_lwage <- lwage ~ age + I(age ^ 2) + bachelor + master + health
model2 <- mnprobit(f_educ, f_lwage, data = cps, cov_type = "gop")
summary(model2)
# Endogenous education switching model
f_lwage2 <- lwage ~ age + I(age ^ 2) + health
model3 <- mnprobit(f_educ, f_lwage2, data = cps,
regimes = c(0, 1, 2), cov_type = "gop")
summary(model3)
# -------------------------------
# Simulated data example 1
# Multinomial probit model
# -------------------------------
# Load required package
library("mnorm")
# ---
# Step 1
# Simulation of data
# ---
# Set seed for reproducibility
set.seed(123)
# The number of observations
n <- 1000
# Regressors (covariates)
w1 <- runif(n = n, min = -1, max = 1)
w2 <- runif(n = n, min = -1, max = 1)
# Random errors
sigma.1 <- 1
sigma.2 <- 0.9
rho <- 0.7
sigma <- matrix(c(sigma.1 ^ 2, sigma.1 * sigma.2 * rho,
sigma.1 * sigma.2 * rho, sigma.2 ^ 2),
ncol = 2, byrow = TRUE)
u <- rmnorm(n = n, mean = c(0, 0), sigma = sigma)
# Coefficients
gamma.1 <- c(0.1, 2, 3)
gamma.2 <- c(-0.1, 3, -2)
# Linear index
z1.li <- gamma.1[1] + gamma.1[2] * w1 + gamma.1[3] * w2
z2.li <- gamma.2[1] + gamma.2[2] * w1 + gamma.2[3] * w2
# Latent variable
z1.star <- z1.li + u[, 1]
z2.star <- z2.li + u[, 2]
z3.star <- rep(0, n)
# Observable ordered outcome
z <- rep(3, n)
z[(z1.star > z2.star) & (z1.star > z3.star)] <- 1
z[(z2.star > z1.star) & (z2.star > z3.star)] <- 2
table(z)
# Data
data <- data.frame(w1 = w1, w2 = w2, z = z)
# ---
# Step 2
# Estimation of parameters
# ---
# Estimation
model <- mnprobit(z ~ w1 + w2,
data = data)
summary(model)
# Compare estimates and true values of parameters
# regression coefficients
cbind(true = gamma.1, estimate = model$coef[, 1])
cbind(true = gamma.2, estimate = model$coef[, 2])
# covariances
cbind(true = c(sigma[1, 2], sigma[2, 2]),
estimate = c(model$sigma[1, 2], model$sigma[2, 2]))
# ---
# Step 3
# Estimation of probabilities and marginal effects
# ---
# For every observation in a sample predict
# probability of 2-nd alternative i.e. P(z = 2)
prob <- predict(model, alt = 2, type = "prob")
head(prob)
# probability of each alternative
prob <- predict(model, alt = NULL, type = "prob")
head(prob)
# Calculate mean marginal effect of w2 on P(z = 1)
mean(predict(model, alt = 1, type = "prob", me = "w2"))
# Calculate probabilities and marginal effects
# for manually provided observations.
# new data
newdata <- data.frame(z = c(1, 1),
w1 = c(0.5, 0.2),
w2 = c(-0.3, 0.8))
# probability P(z = 2)
predict(model, alt = 2, type = "prob", newdata = newdata)
# linear index
predict(model, type = "li", newdata = newdata)
# marginal effect of w1 on P(z = 2)
predict(model, alt = 2, type = "prob", newdata = newdata, me = "w1")
# marginal effect of w1 and w2 on P(z = 3)
predict(model, alt = 3, type = "prob",
newdata = newdata, me = c("w1", "w2"))
# marginal effect of w2 on the linear index
predict(model, alt = 2, type = "li", newdata = newdata, me = "w2")
# discrete marginal effect i.e. P(z = 2 | w1 = 0.5) - P(z = 2 | w1 = 0.2)
predict(model, alt = 2, type = "prob", newdata = newdata,
me = "w2", eps = c(0.2, 0.5))
# adjusted conditional expectation for endogenous switching and
# sample selection models with continuous outcome with random error 'e'
# E(e | z = 2) / cov(e, u)
# where joint distribution of 'e' and 'u' determined by
# Gaussian copula and 'e' is normally distributed
predict(model, alt = 2, type = "lambda", newdata = newdata)
# -------------------------------
# Simulated data example 2
# Multinomial selection model
# -------------------------------
# Load required package
library("mnorm")
# ---
# Step 1
# Simulation of data
# ---
# Set seed for reproducibility
set.seed(123)
# The number of observations
n <- 1000
# Random errors
sd.z2 <- sqrt(0.9)
cor.z <- 0.3
sd.y0 <- sqrt(2)
cor.z1y0 <- 0.4
cor.z2y0 <- 0.7
sd.y1 <- sqrt(1.8)
cor.z1y1 <- 0.3
cor.z2y1 <- 0.6
cor.y <- 0.8
sigma <- matrix(c(
1, cor.z * sd.z2, cor.z1y0 * sd.y0, cor.z1y1 * sd.y1,
cor.z * sd.z2, sd.z2 ^ 2, cor.z2y0 * sd.z2 * sd.y0, cor.z2y1 * sd.z2 * sd.y1,
cor.z1y0 * sd.y0, cor.z2y0 * sd.z2 * sd.y0, sd.y0 ^ 2, cor.y * sd.y0 * sd.y1,
cor.z1y1 * sd.y1, cor.z2y1 * sd.z2 * sd.y1, cor.y * sd.y0 * sd.y1, sd.y1 ^ 2),
ncol = 4, byrow = TRUE)
colnames(sigma) <- c("z1", "z2", "y0", "y1")
rownames(sigma) <- colnames(sigma)
# Simulate random errors
errors <- rmnorm(n, c(0, 0, 0, 0), sigma)
u <- errors[, 1:2]
eps <- errors[, 3:4]
# Regressors (covariates)
x1 <- runif(n, -1, 1)
x2 <- runif(n, -1, 1)
x3 <- (x2 + runif(n, -1, 1)) / 2
W <- cbind(1, x1, x2)
X <- cbind(1, x1, x3)
# Coefficients
gamma <- cbind(c(0.1, 1, 1),
c(0.2, -1.2, 0.8))
beta <- cbind(c(1, -1, 2),
c(1, -2, 1))
# Linear indexes
z1.li <- W %*% gamma[, 1]
z2.li <- W %*% gamma[, 2]
y0.li <- X %*% beta[, 1]
y1.li <- X %*% beta[, 2]
# Latent variables
z1.star <- z1.li + u[, 1]
z2.star <- z2.li + u[, 2]
y0.star <- y0.li + eps[, 1]
y1.star <- y1.li + eps[, 2]
# Obvservable variable as a dummy
z1 <- (z1.star > z2.star) & (z1.star > 0)
z2 <- (z2.star > z1.star) & (z2.star > 0)
z3 <- (z1 != 1) & (z2 != 1)
# Aggregate observable variable
z <- rep(0, n)
z[z1] <- 1
z[z2] <- 2
z[z3] <- 3
table(z)
# Make unobservable values of continuous variable
y <- rep(Inf, n)
y[z == 1] <- y0.star[z == 1]
y[z == 2] <- y1.star[z == 2]
# Data
data <- data.frame(z = z, y = y,
x1 = x1, x2 = x2, x3 = x3)
# ---
# Step 2
# Estimation of parameters
# ---
# Maximum likelihood method
model.ml <- mnprobit(z ~ x1 + x2,
y ~ x1 + x3, regimes = c(0, 1, -1),
data = data, cov_type = "gop")
summary(model.ml)
# Two-step method
model.2step <- mnprobit(z ~ x1 + x2,
y ~ x1 + x3, regimes = c(0, 1, -1),
data = data, estimator = "2step")
summary(model.2step)
# Semiparametric estimator using 2-nd and 3-rd level polynomials
model.snp <- mnprobit(z ~ x1 + x2,
y ~ x1 + x3, regimes = c(0, 1, -1),
data = data, estimator = "2step",
degrees = c(2, 3))
summary(model.snp)
# Simple least squares as a benchmark
model.lm0 <- lm(y ~ x1 + x3, data = data[z == 1, ])
model.lm1 <- lm(y ~ x1 + x3, data = data[z == 2, ])
# Compare coefficients of continuous equations
# y0
cbind(true = beta[, 1],
ml = model.ml$coef2[, 1],
twostep = model.2step$coef2[, 1],
semiparametric = model.snp$coef2[, 1],
ls = coef(model.lm0))
# y1
cbind(true = beta[, 2],
ml = model.ml$coef2[, 2],
twostep = model.2step$coef2[, 2],
semiparametric = model.snp$coef2[, 2],
ls = coef(model.lm1))
# Compare coefficients of multinomial equations
# 1-nd alternative
cbind(true = gamma[, 1],
ml = model.ml$coef[, 1],
twostep = model.2step$coef[, 1])
# 2-nd alternative
cbind(true = gamma[, 2],
ml = model.ml$coef[, 2],
twostep = model.2step$coef[, 2])
# Compare variances of random errors associated with
# z2
cbind(true = sigma[2, 2], ml = model.ml$sigma[2, 2])
# y0
cbind(true = sd.y0 ^ 2, ml = model.ml$var2[1])
# y1
cbind(true = sd.y1 ^ 2, ml = model.ml$var2[2])
# compare covariances between
# z1 and z2
cbind(true = cor.z * sd.z2,
ml = model.ml$sigma[1, 2],
twostep = model.2step$sigma[1, 2])
# z1 and y0
cbind(true = cor.z1y0 * sd.y0,
ml = model.ml$cov2[1, 1],
twostep = model.2step$cov2[1, 1])
# z2 and y0
cbind(true = cor.z2y0 * sd.y0, ml = model.ml$cov2[2, 1])
# z1 and y1
cbind(true = cor.z1y1 * sd.y1, ml = model.ml$cov2[1, 2])
# z2 and y1
cbind(true = cor.z2y1 * sd.y1, ml = model.ml$cov2[2, 2])
# ---
# Step 3
# Predictions and marginal effects
# ---
# Unconditional expectation E(y1) for every observation in a sample
predict(model.ml, type = "val", regime = 1, alt = NULL)
# Marginal effect of x1 on conditional expectation E(y0|z = 2)
# for every observation in a sample
predict(model.ml, type = "val", regime = 0, alt = 2, me = "x1")
# Calculate predictions and marginal effects
# for manually provided observations
# using abovementioned models.
newdata <- data.frame(z = c(1, 1),
y = c(1, 1),
x1 = c(0.5, 0.2),
x2 = c(-0.3, 0.8),
x3 = c(0.6, -0.7))
# Unconditional expectation E(y0)
predict(model.ml, type = "val", regime = 0, alt = NULL, newdata = newdata)
predict(model.2step, type = "val", regime = 0, alt = NULL, newdata = newdata)
predict(model.snp, type = "val", regime = 0, alt = NULL, newdata = newdata)
# Conditional expectation E(y1|z=3)
predict(model.ml, type = "val", regime = 1, alt = 3, newdata = newdata)
predict(model.2step, type = "val", regime = 1, alt = 3, newdata = newdata)
predict(model.snp, type = "val", regime = 1, alt = 3, newdata = newdata)
# Marginal effect of x2 on E(y0|z = 1)
predict(model.ml, type = "val", regime = 0,
alt = 1, me = "x2", newdata = newdata)
predict(model.2step, type = "val", regime = 0,
alt = 1, me = "x2", newdata = newdata)
predict(model.snp, type = "val", regime = 0,
alt = 1, me = "x2", newdata = newdata)