npuniden.boundary {np} | R Documentation |
Kernel Bounded Univariate Density Estimation Via Boundary Kernel Functions
Description
npuniden.boundary
computes kernel univariate unconditional
density estimates given a vector of continuously distributed training
data and, optionally, a bandwidth (otherwise least squares
cross-validation is used for its selection). Lower and upper bounds
[a
,b
] can be supplied (default is the empirical support
[min(X),max(X)]
) and if a
is set to -Inf
there is only one bound on the right, while if
b
is set to Inf
there is only one bound on the left. If
a
is set to -Inf
and b
to Inf
and the
Gaussian type 1 kernel function is used, this will deliver the
standard unadjusted kernel density estimate.
Usage
npuniden.boundary(X = NULL,
Y = NULL,
h = NULL,
a = min(X),
b = max(X),
bwmethod = c("cv.ls","cv.ml"),
cv = c("grid-hybrid","numeric"),
grid = NULL,
kertype = c("gaussian1","gaussian2",
"beta1","beta2",
"fb","fbl","fbu",
"rigaussian","gamma"),
nmulti = 5,
proper = FALSE)
Arguments
X |
a required numeric vector of training data lying in |
Y |
an optional numeric vector of evaluation data lying in |
h |
an optional bandwidth (>0) |
a |
an optional lower bound (defaults to lower bound of empirical support |
b |
an optional upper bound (defaults to upper bound of empirical support |
bwmethod |
whether to conduct bandwidth search via least squares cross-validation
( |
cv |
an optional argument for search (default is likely more reliable in the presence of local maxima) |
grid |
an optional grid used for the initial grid search when |
kertype |
an optional kernel specification (defaults to "gaussian1") |
nmulti |
number of multi-starts used when |
proper |
an optional logical value indicating whether to enforce proper density
and distribution function estimates over the range |
Details
Typical usages are (see below for a complete list of options and also the examples at the end of this help file)
model <- npuniden.boundary(X,a=-2,b=3)
npuniden.boundary
implements a variety of methods for
estimating a univariate density function defined over a continuous
random variable in the presence of bounds via the use of so-called
boundary or edge kernel functions.
The kernel functions "beta1"
and "beta2"
are Chen's
(1999) type 1 and 2 kernel functions with biases of O(h)
, the
"gamma"
kernel function is from Chen (2000) with a bias of
O(h)
, "rigaussian"
is the reciprocal inverse Gaussian
kernel function (Scaillet (2004), Igarashi & Kakizawa (2014)) with
bias of O(h)
, and "gaussian1"
and "gaussian2"
are
truncated Gaussian kernel functions with biases of O(h)
and
O(h^2)
, respectively. The kernel functions "fb"
,
"fbl"
and "fbu"
are floating boundary polynomial
biweight kernels with biases of O(h^2)
(Scott (1992), Page
146). Without exception, these kernel functions are asymmetric in
general with shape that changes depending on where the density is
being estimated (i.e., how close the estimation point x
in
\hat f(x)
is to a boundary). This function is written purely in
R, so to see the exact form for each of these kernel functions, simply
enter the name of this function in R (i.e., enter
npuniden.boundary
after loading this package) and scroll up for
their definitions.
The kernel functions "gamma"
, "rigaussian"
, and
"fbl"
have support [a,\infty]
. The kernel function
"fbu"
has support [-\infty,b]
. The rest have support on
[a,b]
. Note that the two sided support default values are
a=0
and b=1
. When estimating a variable on
[0,\infty)
the default lower bound can be used but when
estimating a variable on (-\infty,0]
you must manually set the
upper bound to b=0
.
Note that data-driven bandwidth selection is more nuanced in bounded
settings, therefore it would be prudent to manually select a bandwidth
that is, say, 1/25th of the range of the data and manually inspect the
estimate (say h=0.05
when X\in [0,1]
). Also, it may be
wise to compare the density estimate with that from a histogram with
the option breaks=25
. Note also that the kernel functions
"gaussian2"
, "fb"
, "fbl"
and "fbu"
can
assume negative values leading to potentially negative density
estimates, and must be trimmed when conducting likelihood
cross-validation which can lead to oversmoothing. Least squares
cross-validation is unaffected and appears to be more reliable in such
instances hence is the default here.
Scott (1992, Page 149) writes “While boundary kernels can be very useful, there are potentially serious problems with real data. There are an infinite number of boundary kernels reflecting the spectrum of possible design constraints, and these kernels are not interchangeable. Severe artifacts can be introduced by any one of them in inappropriate situations. Very careful examination is required to avoid being victimized by the particular boundary kernel chosen. Artifacts can unfortunately be introduced by the choice of the support interval for the boundary kernel.”
Note that since some kernel functions can assume negative values, this
can lead to improper density estimates. The estimated distribution
function is obtained via numerical integration of the estimated
density function and may itself not be proper even when evaluated on
the full range of the data [a,b]
. Setting the option
proper=TRUE
will render the density and distribution estimates
proper over the full range of the data, though this may not in
general be a mean square error optimal strategy.
Finally, note that this function is pretty bare-bones relative to other functions in this package. For one, at this time there is no automatic print support so kindly see the examples for illustrations of its use, among other differences.
Value
npuniden.boundary
returns the following components:
f |
estimated density at the points X |
F |
estimated distribution at the points X (numeric integral of f) |
sd.f |
asymptotic standard error of the estimated density at the points X |
sd.F |
asymptotic standard error of the estimated distribution at the points X |
h |
bandwidth used |
nmulti |
number of multi-starts used |
Author(s)
Jeffrey S. Racine racinej@mcmaster.ca
References
Bouezmarni, T. and Rolin, J.-M. (2003). “Consistency of the beta kernel density function estimator,” The Canadian Journal of Statistics / La Revue Canadienne de Statistique, 31(1):89-98.
Chen, S. X. (1999). “Beta kernel estimators for density functions,” Computational Statistics & Data Analysis, 31(2):131-145.
Chen, S. X. (2000). “Probability density function estimation using gamma kernels,” Annals of the Institute of Statistical Mathematics, 52(3):471-480.
Diggle, P. (1985). “A kernel method for smoothing point process data,” Journal of the Royal Statistical Society. Series C (Applied Statistics), 34(2):138-147.
Igarashi, G. and Y. Kakizawa (2014). “Re-formulation of inverse Gaussian, reciprocal inverse Gaussian, and Birnbaum-Saunders kernel estimators,” Statistics & Probability Letters, 84:235-246.
Igarashi, G. and Y. Kakizawa (2015). “Bias corrections for some asymmetric kernel estimators,” Journal of Statistical Planning and Inference, 159:37-63.
Igarashi, G. (2016). “Bias reductions for beta kernel estimation,” Journal of Nonparametric Statistics, 28(1):1-30.
Scaillet, O. (2004). “Density estimation using inverse and reciprocal inverse Gaussian kernels,” Journal of Nonparametric Statistics, 16(1-2):217-226.
Scott, D. W. (1992). “Multivariate density estimation: Theory, practice, and visualization,” New York: Wiley.
Zhang, S. and R. J. Karunamuni (2010). “Boundary performance of the beta kernel estimators,” Journal of Nonparametric Statistics, 22(1):81-104.
See Also
The Ake, bde, and Conake packages and the function npuniden.reflect
.
Examples
## Not run:
## Example 1: f(0)=0, f(1)=1, plot boundary corrected density,
## unadjusted density, and DGP
set.seed(42)
n <- 100
X <- sort(rbeta(n,5,1))
dgp <- dbeta(X,5,1)
model.g1 <- npuniden.boundary(X,kertype="gaussian1")
model.g2 <- npuniden.boundary(X,kertype="gaussian2")
model.b1 <- npuniden.boundary(X,kertype="beta1")
model.b2 <- npuniden.boundary(X,kertype="beta2")
model.fb <- npuniden.boundary(X,kertype="fb")
model.unadjusted <- npuniden.boundary(X,a=-Inf,b=Inf)
ylim <- c(0,max(c(dgp,model.g1$f,model.g2$f,model.b1$f,model.b2$f,model.fb$f)))
plot(X,dgp,ylab="Density",ylim=ylim,type="l")
lines(X,model.g1$f,lty=2,col=2)
lines(X,model.g2$f,lty=3,col=3)
lines(X,model.b1$f,lty=4,col=4)
lines(X,model.b2$f,lty=5,col=5)
lines(X,model.fb$f,lty=6,col=6)
lines(X,model.unadjusted$f,lty=7,col=7)
rug(X)
legend("topleft",c("DGP",
"Boundary Kernel (gaussian1)",
"Boundary Kernel (gaussian2)",
"Boundary Kernel (beta1)",
"Boundary Kernel (beta2)",
"Boundary Kernel (floating boundary)",
"Unadjusted"),col=1:7,lty=1:7,bty="n")
## Example 2: f(0)=0, f(1)=0, plot density, distribution, DGP, and
## asymptotic point-wise confidence intervals
set.seed(42)
X <- sort(rbeta(100,5,3))
model <- npuniden.boundary(X)
par(mfrow=c(1,2))
ylim=range(c(model$f,model$f+1.96*model$sd.f,model$f-1.96*model$sd.f,dbeta(X,5,3)))
plot(X,model$f,ylim=ylim,ylab="Density",type="l",)
lines(X,model$f+1.96*model$sd.f,lty=2)
lines(X,model$f-1.96*model$sd.f,lty=2)
lines(X,dbeta(X,5,3),col=2)
rug(X)
legend("topleft",c("Density","DGP"),lty=c(1,1),col=1:2,bty="n")
plot(X,model$F,ylab="Distribution",type="l")
lines(X,model$F+1.96*model$sd.F,lty=2)
lines(X,model$F-1.96*model$sd.F,lty=2)
lines(X,pbeta(X,5,3),col=2)
rug(X)
legend("topleft",c("Distribution","DGP"),lty=c(1,1),col=1:2,bty="n")
## Example 3: Age for working age males in the cps71 data set bounded
## below by 21 and above by 65
data(cps71)
attach(cps71)
model <- npuniden.boundary(age,a=21,b=65)
par(mfrow=c(1,1))
hist(age,prob=TRUE,main="")
lines(age,model$f)
lines(density(age,bw=model$h),col=2)
legend("topright",c("Boundary Kernel","Unadjusted"),lty=c(1,1),col=1:2,bty="n")
detach(cps71)
## End(Not run)