R: Semi-nonparametric maximum likelihood estimation

hpaML {hpa}

R Documentation

Semi-nonparametric maximum likelihood estimation

Description

This function performs semi-nonparametric (SNP) maximum likelihood estimation of unknown (possibly truncated) multivariate density using Hermite polynomial based approximating function proposed by Gallant and Nychka in 1987. Please, see dhpa 'Details' section to get more information concerning this approximating function.

Usage

hpaML(
  data,
  pol_degrees = numeric(0),
  tr_left = numeric(0),
  tr_right = numeric(0),
  given_ind = numeric(0),
  omit_ind = numeric(0),
  x0 = numeric(0),
  cov_type = "sandwich",
  boot_iter = 100L,
  is_parallel = FALSE,
  opt_type = "optim",
  opt_control = NULL,
  is_validation = TRUE
)

Arguments

`data`	numeric matrix which rows are realizations of independent identically distributed random vectors while columns correspond to variables.
`pol_degrees`	non-negative integer vector of polynomial degrees (orders).
`tr_left`	numeric vector of left (lower) truncation limits.
`tr_right`	numeric vector of right (upper) truncation limits.
`given_ind`	logical or numeric vector indicating whether corresponding random vector component is conditioned. By default it is a logical vector of `FALSE` values. If `give_ind[i]` equals `TRUE` or `i` then `i`-th column of `x` matrix will contain conditional values.
`omit_ind`	logical or numeric vector indicating whether corresponding random component is omitted. By default it is a logical vector of `FALSE` values. If `omit_ind[i]` equals `TRUE` or `i` then values in `i`-th column of `x` matrix will be ignored.
`x0`	numeric vector of optimization routine initial values. Note that `x0=c(pol_coefficients[-1], mean, sd)`. For `pol_coefficients`, `mean` and `sd` documentation see `dhpa` function.
`cov_type`	character determining the type of covariance matrix to be returned and used for summary. If `cov_type = "hessian"` then negative inverse of Hessian matrix will be applied. If `cov_type = "gop"` then inverse of Jacobian outer products will be used. If `cov_type = "sandwich"` (default) then sandwich covariance matrix estimator will be applied. If `cov_type = "bootstrap"` then bootstrap with `boot_iter` iterations will be used. If `cov_type = "hessianFD"` or `cov_type = "sandwichFD"` then (probably) more accurate but computationally demanding central difference Hessian approximation will be calculated for the inverse Hessian and sandwich estimators correspondingly. Central differences are computed via analytically provided gradient. This Hessian matrix estimation approach seems to be less accurate than BFGS approximation if polynomial order is high (usually greater then 5).
`boot_iter`	the number of bootstrap iterations for `cov_type = "bootstrap"` covariance matrix estimator type.
`is_parallel`	if `TRUE` then multiple cores will be used for some calculations. It usually provides speed advantage for large enough samples (about more than 1000 observations).
`opt_type`	string value determining the type of the optimization routine to be applied. The default is `"optim"` meaning that BFGS method from the `optim` function will be applied. If `opt_type = "GA"` then `ga` function will be additionally applied.
`opt_control`	a list containing arguments to be passed to the optimization routine depending on `opt_type` argument value. Please see details to get additional information.
`is_validation`	logical value indicating whether function input arguments should be validated. Set it to `FALSE` for slight performance boost (default value is `TRUE`).

Details

Densities Hermite polynomial approximation approach has been proposed by A. Gallant and D. W. Nychka in 1987. The main idea is to approximate unknown distribution density with scaled Hermite polynomial. For more information please refer to the literature listed below.

Let's use notations introduced in dhpa 'Details' section. Function hpaML maximizes the following quasi log-likelihood function:

\ln L(\alpha, \mu, \sigma; x) = \sum\limits_{i=1}^{n} \ln\left(f_{\xi}(x_{i};\alpha, \mu, \sigma)\right),

where (in addition to previously defined notations):

x_{i} - are observations i.e. data matrix rows.

n - is sample size i.e. the number of data matrix rows.

Arguments pol_degrees, tr_left, tr_right, given_ind and omit_ind affect the form of f_{\xi}\left(x_{i};\alpha, \mu, \sigma)\right) in a way described in dhpa 'Details' section. Note that change of given_ind and omit_ind values may result in estimator which statistical properties has not been rigorously investigated yet.

The first polynomial coefficient (zero powers) set to 1 for identification purposes i.e. \alpha_{(0,...,0)}=1.

All NA and NaN values will be removed from data matrix.

The function calculates standard errors via sandwich estimator and significance levels are reported taking into account quasi maximum likelihood estimator (QMLE) asymptotic normality. If one wants to switch from QMLE to semi-nonparametric estimator (SNPE) during hypothesis testing then covariance matrix should be estimated again using bootstrap.

This function maximizes (quasi) log-likelihood function via optim function setting its method argument to "BFGS". If opt_type = "GA" then genetic algorithm from ga function will be additionally (after optim putting its solution (par) into suggestions matrix) applied in order to perform global optimization. Note that global optimization takes much more time (usually minutes but sometimes hours or even days). The number of iterations and population size of the genetic algorithm will grow linearly along with the number of estimated parameters. If it seems that global maximum has not been found then it is possible to continue the search restarting the function setting its input argument x0 to x1 output value. Note that if cov_type = "bootstrap" then ga function will not be used for bootstrap iterations since it may be extremely time consuming.

If opt_type = "GA" then opt_control should be the list containing the values to be passed to ga function. It is possible to pass arguments lower, upper, popSize, pcrossover, pmutation, elitism, maxiter, suggestions, optim, optimArgs, seed and monitor. Note that it is possible to set population, selection, crossover and mutation arguments changing ga default parameters via gaControl function. These arguments information reported in ga. In order to provide manual values for lower and upper bounds please follow parameters ordering mentioned above for the x0 argument. If these bounds are not provided manually then they (except those related to the polynomial coefficients) will depend on the estimates obtained by local optimization via optim function (this estimates will be in the middle between lower and upper). Specifically for each sd parameter lower (upper) bound is 5 times lower (higher) than this parameter optim estimate. For each mean and regression coefficient parameter its lower and upper bounds deviate from corresponding optim estimate by two absolute values of this estimate. Finally, lower and upper bounds for each polynomial coefficient are -10 and 10 correspondingly (do not depend on their optim estimates).

The following arguments are differ from their defaults in ga:

pmutation = 0.2,
optim = TRUE,
optimArgs = list("method" = "Nelder-Mead", "poptim" = 0.2, "pressel" = 0.5),
seed = 8,
elitism = 2 + round(popSize * 0.1).

The arguments popSize and maxiter of ga function have been set proportional to the number of estimated polynomial coefficients:

popSize = 10 + (prod(pol_degrees + 1) - 1) * 2.
maxiter = 50 * (prod(pol_degrees + 1))

Value

This function returns an object of class "hpaML".

An object of class "hpaML" is a list containing the following components:

optim - optim function output. If opt_type = "GA" then it is the list containing optim and ga functions outputs.
x1 - numeric vector of distribution parameters estimates.
mean - density function mean vector estimate.
sd - density function sd vector estimate.
pol_coefficients - polynomial coefficients estimates.
tr_left - the same as tr_left input parameter.
tr_right - the same as tr_right input parameter.
omit_ind - the same as omit_ind input parameter.
given_ind - the same as given_ind input parameter.
cov_mat - covariance matrix estimate.
results - numeric matrix representing estimation results.
log-likelihood - value of Log-Likelihood function.
AIC - AIC value.
data - the same as data input parameter but without NA observations.
n_obs - number of observations.
bootstrap - list where bootstrap estimation results are stored.

References

A. Gallant and D. W. Nychka (1987) <doi:10.2307/1913241>

Examples


## Approximate Student (t) distribution

# Set seed for reproducibility
set.seed(123)

# Simulate 5000 realizations of Student distribution 
# with 5 degrees of freedom
n <- 5000
df <- 5
x <- matrix(rt(n, df), ncol = 1)
pol_degrees <- c(4)

# Apply pseudo maximum likelihood routine
ml_result <- hpa::hpaML(data = x, pol_degrees = pol_degrees)
summary(ml_result)

# Get predicted probabilites (density values) approximations
predict(ml_result)

# Plot density approximation
plot(ml_result)

## Approximate chi-squared distribution

# Set seed for reproducibility
set.seed(123)

# Simulate 5000 realizations of chi-squared distribution 
# with 5 degrees of freedom

n <- 5000
df <- 5
x <- matrix(rchisq(n, df), ncol = 1)
pol_degrees <- c(5)

# Apply pseudo maximum likelihood routine
ml_result <- hpaML(data = x, pol_degrees = as.vector(pol_degrees), 
				tr_left = 0)
summary(ml_result)

# Get predicted probabilites (density values) approximations
predict(ml_result)

# Plot density approximation
plot(ml_result)

## Approximate multivariate Student (t) distribution
## Note that calculations may take up to a minute

# Set seed for reproducibility
set.seed(123)

# Simulate 5000 realizations of three dimensional Student distribution 
# with 5 degrees of freedom
library("mvtnorm")
cov_mat <- matrix(c(1, 0.5, -0.5, 0.5, 1, 0.5, -0.5, 0.5, 1), ncol = 3)
x <- rmvt(n = 5000, sigma = cov_mat, df = 5)

# Estimate approximating joint distribution parameters
ml_result <- hpaML(data = x, pol_degrees = c(1, 1, 1))

# Get summary
summary(ml_result)

# Get predicted values for joint density function
predict(ml_result)

# Plot density approximation for the
# second random variable
plot(ml_result, ind = 2)

# Plot density approximation for the
# second random variable conditioning
# on x1 = 1
plot(ml_result, ind = 2, given = c(1, NA, NA))

## Approximate Student (t) distribution and plot densities approximated
## under different hermite polynomial degrees against 
## true density (of Student distribution)

# Simulate 5000 realizations of t-distribution with 5 degrees of freedom
n <- 5000
df <- 5
x <- matrix(rt(n, df), ncol=1)

# Apply pseudo maximum likelihood routine
# Create matrix of lists where i-th element contains hpaML results for K=i
ml_result <- matrix(list(), 4, 1)
for(i in 1:4)
{
 ml_result[[i]] <- hpa::hpaML(data = x, pol_degrees = i)
}

# Generate test values
test_values <- seq(qt(0.001, df), qt(0.999, df), 0.001)
n0 <- length(test_values)

# t-distribution density function at test values points
true_pred <- dt(test_values, df)

# Create matrix of lists where i-th element contains 
# densities predictions for K=i
PGN_pred <- matrix(list(), 4, 1)
for(i in 1:4)
{
  PGN_pred[[i]] <- predict(object = ml_result[[i]], 
                           newdata = matrix(test_values, ncol=1))
}
# Plot the result
library("ggplot2")

# prepare the data
h <- data.frame("values" = rep(test_values,5),
                "predictions" = c(PGN_pred[[1]],PGN_pred[[2]],
                                  PGN_pred[[3]],PGN_pred[[4]],
                                  true_pred), 
                "Density" = c(
                  rep("K=1",n0), rep("K=2",n0),
                  rep("K=3",n0), rep("K=4",n0),
                  rep("t-distribution",n0))
                  )
                  
# build the plot
ggplot(h, aes(values, predictions)) + geom_point(aes(color = Density)) +
  theme_minimal() + theme(legend.position = "top", 
                          text = element_text(size=26),
                          legend.title=element_text(size=20), 
                          legend.text=element_text(size=28)) +
  guides(colour = guide_legend(override.aes = list(size=10))
  )

# Get informative estimates summary for K=4
summary(ml_result[[4]])

[Package hpa version 1.3.3 Index]