hpaML {hpa} | R Documentation |
Semi-nonparametric maximum likelihood estimation
Description
This function performs semi-nonparametric (SNP)
maximum likelihood estimation of unknown (possibly truncated) multivariate
density using Hermite polynomial based approximating function proposed by
Gallant and Nychka in 1987. Please, see dhpa
'Details'
section to get more information concerning this approximating function.
Usage
hpaML(
data,
pol_degrees = numeric(0),
tr_left = numeric(0),
tr_right = numeric(0),
given_ind = numeric(0),
omit_ind = numeric(0),
x0 = numeric(0),
cov_type = "sandwich",
boot_iter = 100L,
is_parallel = FALSE,
opt_type = "optim",
opt_control = NULL,
is_validation = TRUE
)
Arguments
data |
numeric matrix which rows are realizations of independent identically distributed random vectors while columns correspond to variables. |
pol_degrees |
non-negative integer vector of polynomial degrees (orders). |
tr_left |
numeric vector of left (lower) truncation limits. |
tr_right |
numeric vector of right (upper) truncation limits. |
given_ind |
logical or numeric vector indicating whether corresponding
random vector component is conditioned. By default it is a logical
vector of |
omit_ind |
logical or numeric vector indicating whether corresponding
random component is omitted. By default it is a logical vector
of |
x0 |
numeric vector of optimization routine initial values.
Note that |
cov_type |
character determining the type of covariance matrix to be
returned and used for summary. If |
boot_iter |
the number of bootstrap iterations
for |
is_parallel |
if |
opt_type |
string value determining the type of the optimization
routine to be applied. The default is |
opt_control |
a list containing arguments to be passed to the
optimization routine depending on |
is_validation |
logical value indicating whether function input
arguments should be validated. Set it to |
Details
Densities Hermite polynomial approximation approach has been proposed by A. Gallant and D. W. Nychka in 1987. The main idea is to approximate unknown distribution density with scaled Hermite polynomial. For more information please refer to the literature listed below.
Let's use notations introduced in dhpa
'Details'
section. Function hpaML
maximizes the following
quasi log-likelihood function:
\ln L(\alpha, \mu, \sigma; x) = \sum\limits_{i=1}^{n}
\ln\left(f_{\xi}(x_{i};\alpha, \mu, \sigma)\right),
where (in addition to previously defined notations):
x_{i}
- are observations i.e. data
matrix rows.
n
- is sample size i.e. the number of data
matrix rows.
Arguments pol_degrees
, tr_left
, tr_right
,
given_ind
and omit_ind
affect the form of
f_{\xi}\left(x_{i};\alpha, \mu, \sigma)\right)
in a way described in
dhpa
'Details' section. Note that change of
given_ind
and omit_ind
values may result in estimator which
statistical properties has not been rigorously investigated yet.
The first polynomial coefficient (zero powers)
set to 1 for identification purposes i.e. \alpha_{(0,...,0)}=1
.
All NA
and NaN
values will be removed from data
matrix.
The function calculates standard errors via sandwich estimator and significance levels are reported taking into account quasi maximum likelihood estimator (QMLE) asymptotic normality. If one wants to switch from QMLE to semi-nonparametric estimator (SNPE) during hypothesis testing then covariance matrix should be estimated again using bootstrap.
This function maximizes (quasi) log-likelihood function
via optim
function setting its method
argument to "BFGS". If opt_type = "GA"
then genetic
algorithm from ga
function
will be additionally (after optim
putting its
solution (par
) into suggestions
matrix) applied in order to
perform global optimization. Note that global optimization takes
much more time (usually minutes but sometimes hours or even days).
The number of iterations and population size of the genetic algorithm
will grow linearly along with the number of estimated parameters.
If it seems that global maximum has not been found then it
is possible to continue the search restarting the function setting
its input argument x0
to x1
output value. Note that
if cov_type = "bootstrap"
then ga
function will not be used for bootstrap iterations since it
may be extremely time consuming.
If opt_type = "GA"
then opt_control
should be the
list containing the values to be passed to ga
function. It is possible to pass arguments lower
, upper
,
popSize
, pcrossover
, pmutation
, elitism
,
maxiter
, suggestions
, optim
, optimArgs
,
seed
and monitor
.
Note that it is possible to set population
,
selection
, crossover
and mutation
arguments changing
ga
default parameters via gaControl
function. These arguments information reported in ga
.
In order to provide manual values for lower
and upper
bounds
please follow parameters ordering mentioned above for the
x0
argument. If these bounds are not provided manually then
they (except those related to the polynomial coefficients)
will depend on the estimates obtained
by local optimization via optim
function
(this estimates will be in the middle
between lower
and upper
).
Specifically for each sd parameter lower
(upper
) bound
is 5 times lower (higher) than this
parameter optim
estimate.
For each mean and regression coefficient parameter its lower and
upper bounds deviate from corresponding optim
estimate
by two absolute values of this estimate.
Finally, lower and upper bounds for each polynomial
coefficient are -10
and 10
correspondingly (do not depend
on their optim
estimates).
The following arguments are differ from their defaults in
ga
:
-
pmutation = 0.2
, -
optim = TRUE
, -
optimArgs = list("method" = "Nelder-Mead", "poptim" = 0.2, "pressel" = 0.5)
, -
seed = 8
, -
elitism = 2 + round(popSize * 0.1)
.
The arguments popSize
and maxiter
of
ga
function have been set proportional to the number of
estimated polynomial coefficients:
-
popSize = 10 + (prod(pol_degrees + 1) - 1) * 2
. -
maxiter = 50 * (prod(pol_degrees + 1))
Value
This function returns an object of class "hpaML".
An object of class "hpaML" is a list containing the following components:
-
optim
-optim
function output. Ifopt_type = "GA"
then it is the list containingoptim
andga
functions outputs. -
x1
- numeric vector of distribution parameters estimates. -
mean
- density function mean vector estimate. -
sd
- density function sd vector estimate. -
pol_coefficients
- polynomial coefficients estimates. -
tr_left
- the same astr_left
input parameter. -
tr_right
- the same astr_right
input parameter. -
omit_ind
- the same asomit_ind
input parameter. -
given_ind
- the same asgiven_ind
input parameter. -
cov_mat
- covariance matrix estimate. -
results
- numeric matrix representing estimation results. -
log-likelihood
- value of Log-Likelihood function. -
AIC
- AIC value. -
data
- the same asdata
input parameter but withoutNA
observations. -
n_obs
- number of observations. -
bootstrap
- list where bootstrap estimation results are stored.
References
A. Gallant and D. W. Nychka (1987) <doi:10.2307/1913241>
See Also
summary.hpaML, predict.hpaML, logLik.hpaML, plot.hpaML
Examples
## Approximate Student (t) distribution
# Set seed for reproducibility
set.seed(123)
# Simulate 5000 realizations of Student distribution
# with 5 degrees of freedom
n <- 5000
df <- 5
x <- matrix(rt(n, df), ncol = 1)
pol_degrees <- c(4)
# Apply pseudo maximum likelihood routine
ml_result <- hpa::hpaML(data = x, pol_degrees = pol_degrees)
summary(ml_result)
# Get predicted probabilites (density values) approximations
predict(ml_result)
# Plot density approximation
plot(ml_result)
## Approximate chi-squared distribution
# Set seed for reproducibility
set.seed(123)
# Simulate 5000 realizations of chi-squared distribution
# with 5 degrees of freedom
n <- 5000
df <- 5
x <- matrix(rchisq(n, df), ncol = 1)
pol_degrees <- c(5)
# Apply pseudo maximum likelihood routine
ml_result <- hpaML(data = x, pol_degrees = as.vector(pol_degrees),
tr_left = 0)
summary(ml_result)
# Get predicted probabilites (density values) approximations
predict(ml_result)
# Plot density approximation
plot(ml_result)
## Approximate multivariate Student (t) distribution
## Note that calculations may take up to a minute
# Set seed for reproducibility
set.seed(123)
# Simulate 5000 realizations of three dimensional Student distribution
# with 5 degrees of freedom
library("mvtnorm")
cov_mat <- matrix(c(1, 0.5, -0.5, 0.5, 1, 0.5, -0.5, 0.5, 1), ncol = 3)
x <- rmvt(n = 5000, sigma = cov_mat, df = 5)
# Estimate approximating joint distribution parameters
ml_result <- hpaML(data = x, pol_degrees = c(1, 1, 1))
# Get summary
summary(ml_result)
# Get predicted values for joint density function
predict(ml_result)
# Plot density approximation for the
# second random variable
plot(ml_result, ind = 2)
# Plot density approximation for the
# second random variable conditioning
# on x1 = 1
plot(ml_result, ind = 2, given = c(1, NA, NA))
## Approximate Student (t) distribution and plot densities approximated
## under different hermite polynomial degrees against
## true density (of Student distribution)
# Simulate 5000 realizations of t-distribution with 5 degrees of freedom
n <- 5000
df <- 5
x <- matrix(rt(n, df), ncol=1)
# Apply pseudo maximum likelihood routine
# Create matrix of lists where i-th element contains hpaML results for K=i
ml_result <- matrix(list(), 4, 1)
for(i in 1:4)
{
ml_result[[i]] <- hpa::hpaML(data = x, pol_degrees = i)
}
# Generate test values
test_values <- seq(qt(0.001, df), qt(0.999, df), 0.001)
n0 <- length(test_values)
# t-distribution density function at test values points
true_pred <- dt(test_values, df)
# Create matrix of lists where i-th element contains
# densities predictions for K=i
PGN_pred <- matrix(list(), 4, 1)
for(i in 1:4)
{
PGN_pred[[i]] <- predict(object = ml_result[[i]],
newdata = matrix(test_values, ncol=1))
}
# Plot the result
library("ggplot2")
# prepare the data
h <- data.frame("values" = rep(test_values,5),
"predictions" = c(PGN_pred[[1]],PGN_pred[[2]],
PGN_pred[[3]],PGN_pred[[4]],
true_pred),
"Density" = c(
rep("K=1",n0), rep("K=2",n0),
rep("K=3",n0), rep("K=4",n0),
rep("t-distribution",n0))
)
# build the plot
ggplot(h, aes(values, predictions)) + geom_point(aes(color = Density)) +
theme_minimal() + theme(legend.position = "top",
text = element_text(size=26),
legend.title=element_text(size=20),
legend.text=element_text(size=28)) +
guides(colour = guide_legend(override.aes = list(size=10))
)
# Get informative estimates summary for K=4
summary(ml_result[[4]])