kld_ci_subsampling {kldest}R Documentation

Uncertainty of KL divergence estimate using Politis/Romano's subsampling bootstrap.


This function computes a confidence interval for KL divergence based on the subsampling bootstrap introduced by Politis and Romano. See Details for theoretical properties of this method.


  Y = NULL,
  q = NULL,
  estimator = kld_est_nn,
  B = 500L,
  alpha = 0.05,
  subsample.size = function(x) x^(2/3),
  convergence.rate = sqrt,
  method = c("quantile", "se"),
  include.boot = FALSE,
  n.cores = 1L,


X, Y

n-by-d and m-by-d data frames or matrices (multivariate samples), or numeric/character vectors (univariate samples, i.e. d = 1), representing n samples from the true distribution PP and m samples from the approximate distribution QQ in d dimensions. Y can be left blank if q is specified (see below).


The density function of the approximate distribution QQ. Either Y or q must be specified. If the distributions are all continuous or all discrete, q can be directly specified as the probability density/mass function. However, for mixed continuous/discrete distributions, q must be given in decomposed form, q(yc,yd)=qcd(ycyd)qd(yd)q(y_c,y_d)=q_{c|d}(y_c|y_d)q_d(y_d), specified as a named list with field cond for the conditional density qcd(ycyd)q_{c|d}(y_c|y_d) (a function that expects two arguments y_c and y_d) and disc for the discrete marginal density qd(yd)q_d(y_d) (a function that expects one argument y_d). If such a decomposition is not available, it may be preferable to instead simulate a large sample from QQ and use the two-sample syntax.


The Kullback-Leibler divergence estimation method; a function expecting two inputs (X and Y or q, depending on arguments provided). Defaults to kld_est_nn.


Number of bootstrap replicates (default: 500), the larger, the more accurate, but also more computationally expensive.


Error level, defaults to 0.05.


A function specifying the size of the subsamples, defaults to f(x)=x2/3f(x) = x^{2/3}.


A function computing the convergence rate of the estimator as a function of sample sizes. Defaults to f(x)=x1/2f(x) = x^{1/2}. If convergence.rate is NULL, it is estimated empirically from the sample(s) using kldest::convergence_rate().


Either "quantile" (the default), also known as the reverse percentile method, or "se" for a normal approximation of the KL divergence estimator using the standard error of the subsamples.


Boolean, TRUE means KL divergence estimates on subsamples are included in the returned list. Defaults to FALSE.


Number of cores to use in parallel computing (defaults to 1, which means that no parallel computing is used). To use this option, the parallel package must be installed and the OS must be of UNIX type (i.e., not Windows). Otherwise, n.cores will be reset to 1, with a message.


Arguments passed on to estimator, i.e. via the call estimator(X, Y = Y, ...) or estimator(X, q = q, ...).


In general terms, tetting bnb_n be the subsample size for a sample of size nn, and τn\tau_n the convergence rate of the estimator, a confidence interval calculated by subsampling has asymptotic coverage 1α1 - \alpha as long as bn/n0b_n/n\rightarrow 0, bnb_n\rightarrow\infty and τbnτn0\frac{\tau_{b_n}}{\tau_n}\rightarrow 0.

In many cases, the convergence rate of the nearest-neighbour based KL divergence estimator is τn=n\tau_n = \sqrt{n} and the condition on the subsample size reduces to bn/n0b_n/n\rightarrow 0 and bnb_n\rightarrow\infty. By default, bn=n2/3b_n = n^{2/3}. In a two-sample problem, nn and bnb_n are replaced by effective sample sizes neff=min(n,m)n_\text{eff} = \min(n,m) and bn,eff=min(bn,bm)b_{n,\text{eff}} = \min(b_n,b_m).


Politis and Romano, "Large sample confidence regions based on subsamples under minimal assumptions", The Annals of Statistics, Vol. 22, No. 4 (1994).


A list with the following fields:


# 1D Gaussian (one- and two-sample problems)
X <- rnorm(100)
Y <- rnorm(100, mean = 1, sd = 2)
q <- function(x) dnorm(x, mean =1, sd = 2)
kld_gaussian(mu1 = 0, sigma1 = 1, mu2 = 1, sigma2 = 2^2)
kld_est_nn(X, Y = Y)
kld_est_nn(X, q = q)
kld_ci_subsampling(X, Y)$ci
kld_ci_subsampling(X, q = q)$ci

[Package kldest version 1.0.0 Index]