kld_est {kldest}R Documentation

Kullback-Leibler divergence estimator for discrete, continuous or mixed data.

Description

For two mixed continuous/discrete distributions with densities pp and qq, and denoting x=(xc,xd)x = (x_\text{c},x_\text{d}), the Kullback-Leibler divergence DKL(pq)D_{KL}(p||q) is given as

DKL(pq)=xdp(xc,xd)log(p(xc,xd)q(xc,xd))dxc.D_{KL}(p||q) = \sum_{x_d} \int p(x_c,x_d) \log\left(\frac{p(x_c,x_d)}{q(x_c,x_d)}\right)dx_c.

Conditioning on the discrete variables xdx_d, this can be re-written as

DKL(pq)=xdp(xd)DKL(p(xd)q(xd))+DKL(pxdqxd).D_{KL}(p||q) = \sum_{x_d} p(x_d) D_{KL}\big(p(\cdot|x_d)||q(\cdot|x_d)\big) + D_{KL}\big(p_{x_d}||q_{x_d}\big).

Here, the terms

DKL(p(xd)q(xd))D_{KL}\big(p(\cdot|x_d)||q(\cdot|x_d)\big)

are approximated via nearest neighbour- or kernel-based density estimates on the datasets X and Y stratified by the discrete variables, and

DKL(pxdqxd)D_{KL}\big(p_{x_d}||q_{x_d}\big)

is approximated using relative frequencies.

Usage

kld_est(
  X,
  Y = NULL,
  q = NULL,
  estimator.continuous = kld_est_nn,
  estimator.discrete = kld_est_discrete,
  vartype = NULL
)

Arguments

X, Y

n-by-d and m-by-d data frames or matrices (multivariate samples), or numeric/character vectors (univariate samples, i.e. d = 1), representing n samples from the true distribution PP and m samples from the approximate distribution QQ in d dimensions. Y can be left blank if q is specified (see below).

q

The density function of the approximate distribution QQ. Either Y or q must be specified. If the distributions are all continuous or all discrete, q can be directly specified as the probability density/mass function. However, for mixed continuous/discrete distributions, q must be given in decomposed form, q(yc,yd)=qcd(ycyd)qd(yd)q(y_c,y_d)=q_{c|d}(y_c|y_d)q_d(y_d), specified as a named list with field cond for the conditional density qcd(ycyd)q_{c|d}(y_c|y_d) (a function that expects two arguments y_c and y_d) and disc for the discrete marginal density qd(yd)q_d(y_d) (a function that expects one argument y_d). If such a decomposition is not available, it may be preferable to instead simulate a large sample from QQ and use the two-sample syntax.

estimator.continuous, estimator.discrete

KL divergence estimators for continuous and discrete data, respectively. Both are functions with two arguments X and Y or X and q, depending on whether a two-sample or one-sample problem is considered. Defaults are kld_est_nn and kld_est_discrete, respectively.

vartype

A length d character vector, with vartype[i] = "c" meaning the i-th variable is continuous, and vartype[i] = "d" meaning it is discrete. If unspecified, vartype is "c" for numeric columns and "d" for character or factor columns. This default will mostly work, except if levels of discrete variables are encoded using numbers (e.g., 0 for females and 1 for males) or for count data.

Value

A scalar, the estimated Kullback-Leibler divergence D^KL(PQ)\hat D_{KL}(P||Q).

Examples

# 2D example, two samples
set.seed(0)
X <- data.frame(cont  = rnorm(10),
                discr = c(rep('a',4),rep('b',6)))
Y <- data.frame(cont  = c(rnorm(5), rnorm(5, sd = 2)),
                discr = c(rep('a',5),rep('b',5)))
kld_est(X, Y)

# 2D example, one sample
set.seed(0)
X <- data.frame(cont  = rnorm(10),
                discr = c(rep(0,4),rep(1,6)))
q <- list(cond = function(xc,xd) dnorm(xc, mean = xd, sd = 1),
          disc = function(xd) dbinom(xd, size = 1, prob = 0.5))
kld_est(X, q = q, vartype = c("c","d"))

[Package kldest version 1.0.0 Index]