fbckden {evmix} | R Documentation |
Cross-validation MLE Fitting of Boundary Corrected Kernel Density Estimation Using a Variety of Approaches
Description
Maximum likelihood estimation for fitting boundary corrected kernel density estimator using a variety of approaches (and many possible kernels), by treating it as a mixture model.
Usage
fbckden(x, linit = NULL, bwinit = NULL, kernel = "gaussian",
extracentres = NULL, bcmethod = "simple", proper = TRUE,
nn = "jf96", offset = NULL, xmax = NULL, add.jitter = FALSE,
factor = 0.1, amount = NULL, std.err = TRUE, method = "BFGS",
control = list(maxit = 10000), finitelik = TRUE, ...)
lbckden(x, lambda = NULL, bw = NULL, kernel = "gaussian",
extracentres = NULL, bcmethod = "simple", proper = TRUE,
nn = "jf96", offset = NULL, xmax = NULL, log = TRUE)
nlbckden(lambda, x, bw = NULL, kernel = "gaussian",
extracentres = NULL, bcmethod = "simple", proper = TRUE,
nn = "jf96", offset = NULL, xmax = NULL, finitelik = FALSE)
Arguments
x |
vector of sample data |
linit |
initial value for bandwidth (as kernel half-width) or |
bwinit |
initial value for bandwidth (as kernel standard deviations) or |
kernel |
kernel name ( |
extracentres |
extra kernel centres used in KDE,
but likelihood contribution not evaluated, or |
bcmethod |
boundary correction method |
proper |
logical, whether density is renormalised to integrate to unity (where needed) |
nn |
non-negativity correction method (simple boundary correction only) |
offset |
offset added to kernel centres (logtrans only) or |
xmax |
upper bound on support (copula and beta kernels only) or |
add.jitter |
logical, whether jitter is needed for rounded kernel centres |
factor |
see |
amount |
see |
std.err |
logical, should standard errors be calculated |
method |
optimisation method (see |
control |
optimisation control list (see |
finitelik |
logical, should log-likelihood return finite value for invalid parameters |
... |
optional inputs passed to |
lambda |
bandwidth for kernel (as half-width of kernel) or |
bw |
bandwidth for kernel (as standard deviations of kernel) or |
log |
logical, if |
Details
The boundary corrected kernel density estimator using a variety of approaches (and many possible kernels) is fitted to the entire dataset using cross-validation maximum likelihood estimation. The estimated bandwidth, variance and standard error are automatically output.
The log-likelihood and negative log-likelihood are also provided for wider
usage, e.g. constructing your own extreme value
mixture models or profile likelihood functions. The parameter
lambda
must be specified in the negative log-likelihood
nlbckden
.
Log-likelihood calculations are carried out in
lbckden
, which takes bandwidths as inputs in
the same form as distribution functions. The negative log-likelihood is a
wrapper for lbckden
, designed towards making
it useable for optimisation (e.g. lambda
given as first input).
The alternate bandwidth definitions are discussed in the
kernels
, with the lambda
used here but
bw
also output. The bw
specification is the same as used in the
density
function.
The possible kernels are also defined in kernels
help
documentation with the "gaussian"
as the default choice.
Unlike the standard KDE, there is no general rule-of-thumb bandwidth for all these estimators, with only certain methods having a guideline in the literature, so none have been implemented. Hence, a bandwidth must always be specified.
The simple
, renorm
, beta1
, beta2
gamma1
and gamma2
density estimates require renormalisation, achieved
by numerical integration, so is very time consuming.
Missing values (NA
and NaN
) are assumed to be invalid data so are ignored.
Cross-validation likelihood is used for kernel density component, obtained by leaving each point out in turn and evaluating the KDE at the point left out:
L(\lambda)\prod_{i=1}^{n} \hat{f}_{-i}(x_i)
where
\hat{f}_{-i}(x_i) = \frac{1}{(n-1)\lambda} \sum_{j=1: j\ne i}^{n} K(\frac{x_i - x_j}{\lambda})
is the KDE obtained when the i
th datapoint is dropped out and then
evaluated at that dropped datapoint at x_i
.
Normally for likelihood estimation of the bandwidth the kernel centres and
the data where the likelihood is evaluated are the same. However, when using
KDE for extreme value mixture modelling the likelihood only those data in the
bulk of the distribution should contribute to the likelihood, but all the
data (including those beyond the threshold) should contribute to the density
estimate. The extracentres
option allows the use to specify extra
kernel centres used in estimating the density, but not evaluated in the
likelihood. The default is to just use the existing data, so
extracentres=NULL
.
The default optimisation algorithm is "BFGS", which requires a finite negative
log-likelihood function evaluation finitelik=TRUE
. For invalid
parameters, a zero likelihood is replaced with exp(-1e6)
. The "BFGS"
optimisation algorithms require finite values for likelihood, so any user
input for finitelik
will be overridden and set to finitelik=TRUE
if either of these optimisation methods is chosen.
It will display a warning for non-zero convergence result comes from
optim
function call.
If the hessian is of reduced rank then the variance (from inverse hessian)
and standard error of bandwidth parameter cannot be calculated, then by default
std.err=TRUE
and the function will stop. If you want the bandwidth estimate
even if the hessian is of reduced rank (e.g. in a simulation study) then
set std.err=FALSE
.
Value
fbckden
gives leave one out cross-validation
(log-)likelihood and
lbckden
gives the negative log-likelihood.
nlbckden
returns a simple list with the following elements
call : | optim call |
x : | (jittered) data vector x |
kerncentres : actual kernel centres used x |
|
init : | linit for lambda |
optim : | complete optim output |
mle : | vector of MLE of bandwidth |
cov : | variance of MLE of bandwidth |
se : | standard error of MLE of bandwidth |
nllh : | minimum negative cross-validation log-likelihood |
n : | total sample size |
lambda : | MLE of lambda (kernel half-width) |
bw : | MLE of bw (kernel standard deviations) |
kernel : | kernel name |
bcmethod : | boundary correction method |
proper : | logical, whether renormalisation is requested |
nn : | non-negative correction method |
offset : | offset for log transformation method |
xmax : | maximum value of scale beta or copula |
The output list has some duplicate entries and repeats some of the inputs to both
provide similar items to those from fpot
and to make it
as useable as possible.
Warning
Two important practical issues arise with MLE for the kernel bandwidth:
1) Cross-validation likelihood is needed for the KDE bandwidth parameter
as the usual likelihood degenerates, so that the MLE \hat{\lambda} \rightarrow 0
as
n \rightarrow \infty
, thus giving a negative bias towards a small bandwidth.
Leave one out cross-validation essentially ensures that some smoothing between the kernel centres
is required (i.e. a non-zero bandwidth), otherwise the resultant density estimates would always
be zero if the bandwidth was zero.
This problem occassionally rears its ugly head for data which has been heavily rounded,
as even when using cross-validation the density can be non-zero even if the bandwidth is zero.
To overcome this issue an option to add a small jitter should be added to the data
(x
only) has been included in the fitting inputs, using the
jitter
function, to remove the ties. The default options red in the
jitter
are specified above, but the user can override these.
Notice the default scaling factor=0.1
, which is a tenth of the default value in the
jitter
function itself.
A warning message is given if the data appear to be rounded (i.e. more than 5 data rounding is the likely culprit. Only use the jittering when the MLE of the bandwidth is far too small.
2) For heavy tailed populations the bandwidth is positively biased, giving oversmoothing
(see example). The bias is due to the distance between the upper (or lower) order statistics not
necessarily decaying to zero as the sample size tends to infinity. Essentially, as the distance
between the two largest (or smallest) sample datapoints does not decay to zero, some smoothing between
them is required (i.e. bandwidth cannot be zero). One solution to this problem is to splice
the GPD at a suitable threshold to remove the problematic tail from the inference for the bandwidth,
using the fbckdengpd
function for a heavy upper tail. See MacDonald et al (2013).
Acknowledgments
Based on code by Anna MacDonald produced for MATLAB.
Note
An initial bandwidth must be provided, so linit
and bwinit
cannot both be NULL
The extra kernel centres extracentres
can either be a vector of data or NULL
.
Invalid parameter ranges will give 0
for likelihood, log(0)=-Inf
for
log-likelihood and -log(0)=Inf
for negative log-likelihood.
Infinite and missing sample values are dropped.
Error checking of the inputs is carried out and will either stop or give warning message as appropriate.
Author(s)
Yang Hu and Carl Scarrott carl.scarrott@canterbury.ac.nz.
References
http://en.wikipedia.org/wiki/Kernel_density_estimation
http://en.wikipedia.org/wiki/Cross-validation_(statistics)
Scarrott, C.J. and MacDonald, A. (2012). A review of extreme value threshold estimation and uncertainty quantification. REVSTAT - Statistical Journal 10(1), 33-59. Available from http://www.ine.pt/revstat/pdf/rs120102.pdf
Bowman, A.W. (1984). An alternative method of cross-validation for the smoothing of density estimates. Biometrika 71(2), 353-360.
Duin, R.P.W. (1976). On the choice of smoothing parameters for Parzen estimators of probability density functions. IEEE Transactions on Computers C25(11), 1175-1179.
MacDonald, A., Scarrott, C.J., Lee, D., Darlow, B., Reale, M. and Russell, G. (2011). A flexible extreme value mixture model. Computational Statistics and Data Analysis 55(6), 2137-2157.
MacDonald, A., C. J. Scarrott, and D. S. Lee (2011). Boundary correction, consistency and robustness of kernel densities using extreme value theory. Submitted. Available from: http://www.math.canterbury.ac.nz/~c.scarrott.
Wand, M. and Jones, M.C. (1995). Kernel Smoothing. Chapman && Hall.
See Also
kernels
, kfun
,
jitter
, density
and
bw.nrd0
Other kden: bckden
, fgkgcon
,
fgkg
, fkdengpdcon
,
fkdengpd
, fkden
,
kdengpdcon
, kdengpd
,
kden
Other bckden: bckdengpdcon
,
bckdengpd
, bckden
,
fbckdengpdcon
, fbckdengpd
,
fkden
, kden
Other bckdengpd: bckdengpdcon
,
bckdengpd
, bckden
,
fbckdengpdcon
, fbckdengpd
,
fkdengpd
, gkg
,
kdengpd
, kden
Other bckdengpdcon: bckdengpdcon
,
bckdengpd
, bckden
,
fbckdengpdcon
, fbckdengpd
,
fkdengpdcon
, gkgcon
,
kdengpdcon
Other fbckden: bckden
Examples
## Not run:
set.seed(1)
par(mfrow = c(1, 1))
nk=500
x = rgamma(nk, shape = 1, scale = 2)
xx = seq(-1, 10, 0.01)
# cut and normalize is very quick
fit = fbckden(x, linit = 0.2, bcmethod = "cutnorm")
hist(x, nk/5, freq = FALSE)
rug(x)
lines(xx, dgamma(xx, shape = 1, scale = 2), col = "black")
# but cut and normalize does not always work well for boundary correction
lines(xx, dbckden(xx, x, lambda = fit$lambda, bcmethod = "cutnorm"), lwd = 2, col = "red")
# Handily, the bandwidth usually works well for other approaches as well
lines(xx, dbckden(xx, x, lambda = fit$lambda, bcmethod = "simple"), lwd = 2, col = "blue")
lines(density(x), lty = 2, lwd = 2, col = "green")
legend("topright", c("True Density", "BC KDE using cutnorm",
"BC KDE using simple", "KDE Using density"),
lty = c(1, 1, 1, 2), lwd = c(1, 2, 2, 2), col = c("black", "red", "blue", "green"))
# By contrast simple boundary correction is very slow
# a crude trick to speed it up is to ignore the normalisation and non-negative correction,
# which generally leads to bandwidth being biased high
fit = fbckden(x, linit = 0.2, bcmethod = "simple", proper = FALSE, nn = "none")
hist(x, nk/5, freq = FALSE)
rug(x)
lines(xx, dgamma(xx, shape = 1, scale = 2), col = "black")
lines(xx, dbckden(xx, x, lambda = fit$lambda, bcmethod = "simple"), lwd = 2, col = "blue")
lines(density(x), lty = 2, lwd = 2, col = "green")
# but ignoring upper tail in likelihood works a lot better
q75 = qgamma(0.75, shape = 1, scale = 2)
fitnotail = fbckden(x[x <= q75], linit = 0.1,
bcmethod = "simple", proper = FALSE, nn = "none", extracentres = x[x > q75])
lines(xx, dbckden(xx, x, lambda = fitnotail$lambda, bcmethod = "simple"), lwd = 2, col = "red")
legend("topright", c("True Density", "BC KDE using simple", "BC KDE (upper tail ignored)",
"KDE Using density"),
lty = c(1, 1, 1, 2), lwd = c(1, 2, 2, 2), col = c("black", "blue", "red", "green"))
## End(Not run)