R: Estimate a Binary Choice Model

estim.bin {ldt}

R Documentation

Estimate a Binary Choice Model

Description

Use this function to estimate a binary choice model.

Usage

estim.bin(
  data,
  linkFunc = c("logit", "probit"),
  pcaOptionsX = NULL,
  costMatrices = NULL,
  optimOptions = get.options.newton(),
  aucOptions = get.options.roc(),
  simFixSize = 0,
  simTrainFixSize = 0,
  simTrainRatio = 0.75,
  simSeed = 0,
  weightedEval = FALSE,
  simMaxConditionNumber = Inf
)

Arguments

`data`	A list that determines data and other required information for the model search process. Use `get.data()` function to generate it from a `matrix` or a `data.frame`.
`linkFunc`	A character string that shows the probability assumption. It can be `logit` or `probit`.
`pcaOptionsX`	A list of options to use principal components of the `x`, instead of the actual values. Set `NULL` to disable. Use `get.options.pca()` for initialization.
`costMatrices`	A list of numeric matrices where each one determines how to score the calculated probabilities. See and use search.bin for more information and initialization.
`optimOptions`	A list for Newton optimization options. Use get.options.newton function to get the options.
`aucOptions`	A list of options for AUC calculation. See and use `[get.options.roc()]` for more information and initialization.
`simFixSize`	An integer that determines the number of out-of-sample simulations. Use zero to disable the simulation.
`simTrainFixSize`	An integer representing the number of data points in the training sample in the out-of-sample simulation. If zero, `trainRatio` will be used.
`simTrainRatio`	A number representing the size of the training sample relative to the available size, in the out-of-sample simulation. It is effective if `trainFixSize` is zero.
`simSeed`	A seed for the random number generator. Use zero for a random value.
`weightedEval`	If `TRUE`, weights will be used in evaluations.
`simMaxConditionNumber`	A number for the maximum value for the condition number in the simulation.

Details

As documented in chapter 12 in Greene and Hensher (2010), binary regression is a statistical technique used to estimate the probability of one of two possible outcomes for a variable such as y, i.e., p=P(y=1) and q=P(y=0). The most commonly used binary regression models are the logit and probit models. In general, a binary regression model can be written as f(p) = z'\gamma+v, where the first element in \gamma is the intercept and f(p) is a link function. For logit and probit models we have f(p) = \ln{\frac{p}{1-p}} and f(p) = \Phi^{-1}(p) respectively, where \Phi^{-1} is the inverse cumulative distribution function of the standard normal distribution.

Given an independent sample of length N, the parameters of the binary regression model are estimated using maximum likelihood estimation. Assuming that some observations are more reliable or informative than others and w_i for i=1,\ldots,N reflects this fact, the likelihood function is given by:

L(\gamma) = \prod_{i=1}^N (p_i)^{w_i y_i} (1-p_i)^{w_i (1-y_i)},

where p_i=\frac{\exp{\gamma z_i}}{1+\exp{\gamma z_i}} for logit model and p_i=\Phi(\gamma z_i) for probit model. ldt uses feasible GLS to calculate the initial value of the coefficients and a weighted least squares estimator to calculate the initial variance matrix of the error terms (see page 781 in Greene (2020)). The condition number of the estimation is calculated by multiplying 1-norm of the observed information matrix at the maximum likelihood estimator and its inverse (e.g., see page 94 in Trefethen and Bau (1997)). Furthermore, if x is a new observations for the explanatory variables, the predicted probability of the positive class is estimated by p_i=\frac{\exp{\gamma x}}{1+\exp{\gamma x}} for logit model and p_i=\Phi(\gamma x) for probit model.

Note that the focus in ldt is model uncertainty and the main purpose of exporting this method is to show the inner calculations of the search process in search.bin function.

References

Greene WH (2020). Econometric analysis, 8th edition. Pearson Education Limited, New York. ISBN 9781292231136.

Greene WH, Hensher DA (2010). Modeling ordered choices: A primer. Cambridge University Press. ISBN 9780511845062, doi:10.1017/cbo9780511845062.

Trefethen LN, Bau D (1997). Numerical linear algebra. Society for Industrial and Applied Mathematics. ISBN 9780898714876.

Examples

# Example 1 (simulation, small model):
set.seed(123)
sample <- sim.bin(3L, 100)
print(sample$coef)

data <- data.frame(sample$y, sample$x)

#   Estimate using glm
fit <- glm(Y ~ X1 + X2, data = data, family = binomial())
print(fit)

#   Estimate using 'ldt::estim.bin'
fit <- estim.bin(data = get.data(data = data,
                                 equations = list(Y ~ X1 + X2)),
                  linkFunc = "logit")
print(fit)
plot_data <- plot(fit, type = 1)
#   See 'plot.ldt.estim()' function documentation


# Example 2 (simulation, large model with PCA analysis):
sample <- sim.bin(30L, 100, probit = TRUE)
data <- data.frame(sample$y, sample$x)
colnames(data) <- c(colnames(sample$y),colnames(sample$x))
pca_options <- get.options.pca(ignoreFirst = 1, exactCount = 3)
fit <- estim.bin(data = get.data(cbind(sample$y, sample$x),
                                  endogenous = ncol(sample$y),
                                  addIntercept = FALSE),
                  linkFunc = "probit",
                  pcaOptionsX = pca_options)
print(fit)
plot_data <- plot(fit, type = 2)

[Package ldt version 0.5.3 Index]