kld_ci_subsampling {kldest}R Documentation

Uncertainty of KL divergence estimate using Politis/Romano's subsampling bootstrap.

Description

This function computes a confidence interval for KL divergence based on the subsampling bootstrap introduced by Politis and Romano. See Details for theoretical properties of this method.

Usage

kld_ci_subsampling(
  X,
  Y = NULL,
  q = NULL,
  estimator = kld_est_nn,
  B = 500L,
  alpha = 0.05,
  subsample.size = function(x) x^(2/3),
  convergence.rate = sqrt,
  method = c("quantile", "se"),
  include.boot = FALSE,
  n.cores = 1L,
  ...
)

Arguments

X, Y

n-by-d and m-by-d data frames or matrices (multivariate samples), or numeric/character vectors (univariate samples, i.e. d = 1), representing n samples from the true distribution P and m samples from the approximate distribution Q in d dimensions. Y can be left blank if q is specified (see below).

q

The density function of the approximate distribution Q. Either Y or q must be specified. If the distributions are all continuous or all discrete, q can be directly specified as the probability density/mass function. However, for mixed continuous/discrete distributions, q must be given in decomposed form, q(y_c,y_d)=q_{c|d}(y_c|y_d)q_d(y_d), specified as a named list with field cond for the conditional density q_{c|d}(y_c|y_d) (a function that expects two arguments y_c and y_d) and disc for the discrete marginal density q_d(y_d) (a function that expects one argument y_d). If such a decomposition is not available, it may be preferable to instead simulate a large sample from Q and use the two-sample syntax.

estimator

The Kullback-Leibler divergence estimation method; a function expecting two inputs (X and Y or q, depending on arguments provided). Defaults to kld_est_nn.

B

Number of bootstrap replicates (default: 500), the larger, the more accurate, but also more computationally expensive.

alpha

Error level, defaults to 0.05.

subsample.size

A function specifying the size of the subsamples, defaults to f(x) = x^{2/3}.

convergence.rate

A function computing the convergence rate of the estimator as a function of sample sizes. Defaults to f(x) = x^{1/2}. If convergence.rate is NULL, it is estimated empirically from the sample(s) using kldest::convergence_rate().

method

Either "quantile" (the default), also known as the reverse percentile method, or "se" for a normal approximation of the KL divergence estimator using the standard error of the subsamples.

include.boot

Boolean, TRUE means KL divergence estimates on subsamples are included in the returned list. Defaults to FALSE.

n.cores

Number of cores to use in parallel computing (defaults to 1, which means that no parallel computing is used). To use this option, the parallel package must be installed and the OS must be of UNIX type (i.e., not Windows). Otherwise, n.cores will be reset to 1, with a message.

...

Arguments passed on to estimator, i.e. via the call estimator(X, Y = Y, ...) or estimator(X, q = q, ...).

Details

In general terms, tetting b_n be the subsample size for a sample of size n, and \tau_n the convergence rate of the estimator, a confidence interval calculated by subsampling has asymptotic coverage 1 - \alpha as long as b_n/n\rightarrow 0, b_n\rightarrow\infty and \frac{\tau_{b_n}}{\tau_n}\rightarrow 0.

In many cases, the convergence rate of the nearest-neighbour based KL divergence estimator is \tau_n = \sqrt{n} and the condition on the subsample size reduces to b_n/n\rightarrow 0 and b_n\rightarrow\infty. By default, b_n = n^{2/3}. In a two-sample problem, n and b_n are replaced by effective sample sizes n_\text{eff} = \min(n,m) and b_{n,\text{eff}} = \min(b_n,b_m).

Reference:

Politis and Romano, "Large sample confidence regions based on subsamples under minimal assumptions", The Annals of Statistics, Vol. 22, No. 4 (1994).

Value

A list with the following fields:

Examples

# 1D Gaussian (one- and two-sample problems)
set.seed(0)
X <- rnorm(100)
Y <- rnorm(100, mean = 1, sd = 2)
q <- function(x) dnorm(x, mean =1, sd = 2)
kld_gaussian(mu1 = 0, sigma1 = 1, mu2 = 1, sigma2 = 2^2)
kld_est_nn(X, Y = Y)
kld_est_nn(X, q = q)
kld_ci_subsampling(X, Y)$ci
kld_ci_subsampling(X, q = q)$ci


[Package kldest version 1.0.0 Index]