R: Uncertainty of KL divergence estimate using Politis/Romano's...

kld_ci_subsampling {kldest}

R Documentation

Uncertainty of KL divergence estimate using Politis/Romano's subsampling bootstrap.

Description

This function computes a confidence interval for KL divergence based on the subsampling bootstrap introduced by Politis and Romano. See Details for theoretical properties of this method.

Usage

kld_ci_subsampling(
  X,
  Y = NULL,
  q = NULL,
  estimator = kld_est_nn,
  B = 500L,
  alpha = 0.05,
  subsample.size = function(x) x^(2/3),
  convergence.rate = sqrt,
  method = c("quantile", "se"),
  include.boot = FALSE,
  n.cores = 1L,
  ...
)

Arguments

`X`, `Y`	`n`-by-`d` and `m`-by-`d` data frames or matrices (multivariate samples), or numeric/character vectors (univariate samples, i.e. `d = 1`), representing `n` samples from the true distribution `P` and `m` samples from the approximate distribution `Q` in `d` dimensions. `Y` can be left blank if `q` is specified (see below).
`q`	The density function of the approximate distribution `Q`. Either `Y` or `q` must be specified. If the distributions are all continuous or all discrete, `q` can be directly specified as the probability density/mass function. However, for mixed continuous/discrete distributions, `q` must be given in decomposed form, `q(y_c,y_d)=q_{c\|d}(y_c\|y_d)q_d(y_d)`, specified as a named list with field `cond` for the conditional density `q_{c\|d}(y_c\|y_d)` (a function that expects two arguments `y_c` and `y_d`) and `disc` for the discrete marginal density `q_d(y_d)` (a function that expects one argument `y_d`). If such a decomposition is not available, it may be preferable to instead simulate a large sample from `Q` and use the two-sample syntax.
`estimator`	The Kullback-Leibler divergence estimation method; a function expecting two inputs (`X` and `Y` or `q`, depending on arguments provided). Defaults to `kld_est_nn`.
`B`	Number of bootstrap replicates (default: `500`), the larger, the more accurate, but also more computationally expensive.
`alpha`	Error level, defaults to `0.05`.
`subsample.size`	A function specifying the size of the subsamples, defaults to `f(x) = x^{2/3}`.
`convergence.rate`	A function computing the convergence rate of the estimator as a function of sample sizes. Defaults to `f(x) = x^{1/2}`. If `convergence.rate` is `NULL`, it is estimated empirically from the sample(s) using `kldest::convergence_rate()`.
`method`	Either `"quantile"` (the default), also known as the reverse percentile method, or `"se"` for a normal approximation of the KL divergence estimator using the standard error of the subsamples.
`include.boot`	Boolean, `TRUE` means KL divergence estimates on subsamples are included in the returned list. Defaults to `FALSE`.
`n.cores`	Number of cores to use in parallel computing (defaults to `1`, which means that no parallel computing is used). To use this option, the `parallel` package must be installed and the OS must be of UNIX type (i.e., not Windows). Otherwise, `n.cores` will be reset to `1`, with a message.
`...`	Arguments passed on to `estimator`, i.e. via the call `estimator(X, Y = Y, ...)` or `estimator(X, q = q, ...)`.

Details

In general terms, tetting b_n be the subsample size for a sample of size n, and \tau_n the convergence rate of the estimator, a confidence interval calculated by subsampling has asymptotic coverage 1 - \alpha as long as b_n/n\rightarrow 0, b_n\rightarrow\infty and \frac{\tau_{b_n}}{\tau_n}\rightarrow 0.

In many cases, the convergence rate of the nearest-neighbour based KL divergence estimator is \tau_n = \sqrt{n} and the condition on the subsample size reduces to b_n/n\rightarrow 0 and b_n\rightarrow\infty. By default, b_n = n^{2/3}. In a two-sample problem, n and b_n are replaced by effective sample sizes n_\text{eff} = \min(n,m) and b_{n,\text{eff}} = \min(b_n,b_m).

Reference:

Politis and Romano, "Large sample confidence regions based on subsamples under minimal assumptions", The Annals of Statistics, Vol. 22, No. 4 (1994).

Value

A list with the following fields:

"est" (the estimated KL divergence),
"ci" (a length 2 vector containing the lower and upper limits of the estimated confidence interval).
"boot" (a length B numeric vector with KL divergence estimates on the bootstrap subsamples), only included if include.boot = TRUE,

Examples

# 1D Gaussian (one- and two-sample problems)
set.seed(0)
X <- rnorm(100)
Y <- rnorm(100, mean = 1, sd = 2)
q <- function(x) dnorm(x, mean =1, sd = 2)
kld_gaussian(mu1 = 0, sigma1 = 1, mu2 = 1, sigma2 = 2^2)
kld_est_nn(X, Y = Y)
kld_est_nn(X, q = q)
kld_ci_subsampling(X, Y)$ci
kld_ci_subsampling(X, q = q)$ci

[Package kldest version 1.0.0 Index]