kden {evmix} | R Documentation |
Kernel Density Estimation, With Variety of Kernels
Description
Density, cumulative distribution function, quantile function and
random number generation for the kernel density estimation using the kernel
specified by kernel
, with a constant bandwidth specified by either
lambda
or bw
.
Usage
dkden(x, kerncentres, lambda = NULL, bw = NULL, kernel = "gaussian",
log = FALSE)
pkden(q, kerncentres, lambda = NULL, bw = NULL, kernel = "gaussian",
lower.tail = TRUE)
qkden(p, kerncentres, lambda = NULL, bw = NULL, kernel = "gaussian",
lower.tail = TRUE)
rkden(n = 1, kerncentres, lambda = NULL, bw = NULL,
kernel = "gaussian")
Arguments
x |
quantiles |
kerncentres |
kernel centres (typically sample data vector or scalar) |
lambda |
bandwidth for kernel (as half-width of kernel) or |
bw |
bandwidth for kernel (as standard deviations of kernel) or |
kernel |
kernel name ( |
log |
logical, if TRUE then log density |
q |
quantiles |
lower.tail |
logical, if FALSE then upper tail probabilities |
p |
cumulative probabilities |
n |
sample size (positive integer) |
Details
Kernel density estimation using one of many possible kernels with a constant bandwidth.
The alternate bandwidth definitions are discussed in the
kernels
, with the lambda
as the default.
The bw
specification is the same as used in the
density
function.
The possible kernels are also defined in kernels
help
documentation with the "gaussian"
as the default choice.
The density function dkden
produces exactly the
same density estimate as density
when a sequence
of x
values are provided, see examples. The latter function is far
more efficient in this situation as it takes advantage of the computational
savings from doing the kernel smoothing in the spectral domain (using the FFT),
where the convolution becomes a multiplication. So even after accounting for applying
the (Fast) Fourier Transform (FFT) and its inverse it is much more efficient
especially for a large sample size or large number of evaluation points.
However, this KDE function applies the less efficient convolution using the standard definition:
\hat{f}_(x) = \frac{1}{n} \sum_{j=1}^{n} K(\frac{x - x_j}{\lambda})
where K(.)
is the density function for the standard
kernel. Thus are no restriction on the values x
can take. For example, in the
"gaussian"
kernel case for a particular x
the density is evaluated as
mean(dnorm(x, kerncentres, lambda))
for the density and
mean(pnorm(x, kerncentres, lambda))
for cumulative distribution
function which is slower than the FFT but is more adaptable.
An inversion sampler is used for random number generation which also rather inefficient, as it can be carried out more efficiently using a mixture representation.
The quantile function is rather complicated as there is no closed form solution,
so is obtained by numerical approximation of the inverse cumulative distribution function
P(X \le q) = p
to find q
. The quantile function
qkden
evaluates the KDE cumulative distribution
function over the range from c(max(kerncentre) - lambda, max(kerncentre) + lambda)
,
or c(max(kerncentre) - 5*lambda, max(kerncentre) + 5*lambda)
for normal kernel.
Outside of this range the quantiles are set to -Inf
for lower tail and Inf
for upper tail. A sequence of values
of length fifty times the number of kernels (with minimum of 1000) is first
calculated. Spline based interpolation using splinefun
,
with default monoH.FC
method, is then used to approximate the quantile
function. This is a similar approach to that taken
by Matt Wand in the qkde
in the ks
package.
If no bandwidth is provided lambda=NULL
and bw=NULL
then the normal
reference rule is used, using the bw.nrd0
function, which is
consistent with the density
function. At least two kernel
centres must be provided as the variance needs to be estimated.
Value
dkden
gives the density,
pkden
gives the cumulative distribution function,
qkden
gives the quantile function and
rkden
gives a random sample.
Acknowledgments
Based on code by Anna MacDonald produced for MATLAB.
Note
Unlike most of the other extreme value mixture model functions the
kden
functions have not been vectorised as
this is not appropriate. The main inputs (x
, p
or q
)
must be either a scalar or a vector, which also define the output length.
The kernel centres kerncentres
can either be a single datapoint or a vector
of data. The kernel centres (kerncentres
) and locations to evaluate density (x
)
and cumulative distribution function (q
) would usually be different.
Default values are provided for all inputs, except for the fundamentals
kerncentres
, x
, q
and p
. The default sample size for
rkden
is 1.
Missing (NA
) and Not-a-Number (NaN
) values in x
,
p
and q
are passed through as is and infinite values are set to
NA
. None of these are not permitted for the parameters.
Error checking of the inputs (e.g. invalid probabilities) is carried out and will either stop or give warning message as appropriate.
Author(s)
Yang Hu and Carl Scarrott carl.scarrott@canterbury.ac.nz.
References
http://en.wikipedia.org/wiki/Kernel_density_estimation
http://en.wikipedia.org/wiki/Cross-validation_(statistics)
Scarrott, C.J. and MacDonald, A. (2012). A review of extreme value threshold estimation and uncertainty quantification. REVSTAT - Statistical Journal 10(1), 33-59. Available from http://www.ine.pt/revstat/pdf/rs120102.pdf
Hu Y. and Scarrott, C.J. (2018). evmix: An R Package for Extreme Value Mixture Modeling, Threshold Estimation and Boundary Corrected Kernel Density Estimation. Journal of Statistical Software 84(5), 1-27. doi: 10.18637/jss.v084.i05.
Bowman, A.W. (1984). An alternative method of cross-validation for the smoothing of density estimates. Biometrika 71(2), 353-360.
Duin, R.P.W. (1976). On the choice of smoothing parameters for Parzen estimators of probability density functions. IEEE Transactions on Computers C25(11), 1175-1179.
MacDonald, A., Scarrott, C.J., Lee, D., Darlow, B., Reale, M. and Russell, G. (2011). A flexible extreme value mixture model. Computational Statistics and Data Analysis 55(6), 2137-2157.
Wand, M. and Jones, M.C. (1995). Kernel Smoothing. Chapman && Hall.
See Also
kernels
, kfun
,
density
, bw.nrd0
and dkde
in ks
package.
Other kden: bckden
, fbckden
,
fgkgcon
, fgkg
,
fkdengpdcon
, fkdengpd
,
fkden
, kdengpdcon
,
kdengpd
Other kdengpd: bckdengpd
,
fbckdengpd
, fgkg
,
fkdengpdcon
, fkdengpd
,
fkden
, gkg
,
kdengpdcon
, kdengpd
Other gkg: fgkgcon
, fgkg
,
fkdengpd
, gkgcon
,
gkg
, kdengpd
Other bckden: bckdengpdcon
,
bckdengpd
, bckden
,
fbckdengpdcon
, fbckdengpd
,
fbckden
, fkden
Other bckdengpd: bckdengpdcon
,
bckdengpd
, bckden
,
fbckdengpdcon
, fbckdengpd
,
fbckden
, fkdengpd
,
gkg
, kdengpd
Other fkden: fkden
Examples
## Not run:
set.seed(1)
par(mfrow = c(2, 2))
nk=50
x = rnorm(nk)
xx = seq(-5, 5, 0.01)
plot(xx, dnorm(xx))
rug(x)
for (i in 1:nk) lines(xx, dnorm(xx, x[i], sd = bw.nrd0(x))*0.05)
lines(xx, dkden(xx, x), lwd = 2, col = "red")
lines(density(x), lty = 2, lwd = 2, col = "green")
legend("topright", c("True Density", "KDE Using evmix", "KDE Using density function"),
lty = c(1, 1, 2), lwd = c(1, 2, 2), col = c("black", "red", "green"))
# Estimate bandwidth using cross-validation likelihood
x = rnorm(nk)
fit = fkden(x)
hist(x, nk/5, freq = FALSE, xlim = c(-5, 5), ylim = c(0, 0.6))
rug(x)
for (i in 1:nk) lines(xx, dnorm(xx, x[i], sd = fit$bw)*0.05)
lines(xx,dnorm(xx), col = "black")
lines(xx, dkden(xx, x, lambda = fit$lambda), lwd = 2, col = "red")
lines(density(x), lty = 2, lwd = 2, col = "green")
lines(density(x, bw = fit$bw), lwd = 2, lty = 2, col = "blue")
legend("topright", c("True Density", "KDE fitted evmix",
"KDE Using density, default bandwidth", "KDE Using density, c-v likelihood bandwidth"),
lty = c(1, 1, 2, 2), lwd = c(1, 2, 2, 2), col = c("black", "red", "green", "blue"))
plot(xx, pnorm(xx), type = "l")
rug(x)
lines(xx, pkden(xx, x), lwd = 2, col = "red")
lines(xx, pkden(xx, x, lambda = fit$lambda), lwd = 2, col = "green")
# green and blue (quantile) function should be same
p = seq(0, 1, 0.001)
lines(qkden(p, x, lambda = fit$lambda), p, lwd = 2, lty = 2, col = "blue")
legend("topleft", c("True Density", "KDE using evmix, normal reference rule",
"KDE using evmix, c-v likelihood","KDE quantile function, c-v likelihood"),
lty = c(1, 1, 1, 2), lwd = c(1, 2, 2, 2), col = c("black", "red", "green", "blue"))
xnew = rkden(10000, x, lambda = fit$lambda)
hist(xnew, breaks = 100, freq = FALSE, xlim = c(-5, 5))
rug(xnew)
lines(xx,dnorm(xx), col = "black")
lines(xx, dkden(xx, x), lwd = 2, col = "red")
legend("topright", c("True Density", "KDE Using evmix"),
lty = c(1, 2), lwd = c(1, 2), col = c("black", "red"))
## End(Not run)