kld_est {kldest}R Documentation

Kullback-Leibler divergence estimator for discrete, continuous or mixed data.

Description

For two mixed continuous/discrete distributions with densities p and q, and denoting x = (x_\text{c},x_\text{d}), the Kullback-Leibler divergence D_{KL}(p||q) is given as

D_{KL}(p||q) = \sum_{x_d} \int p(x_c,x_d) \log\left(\frac{p(x_c,x_d)}{q(x_c,x_d)}\right)dx_c.

Conditioning on the discrete variables x_d, this can be re-written as

D_{KL}(p||q) = \sum_{x_d} p(x_d) D_{KL}\big(p(\cdot|x_d)||q(\cdot|x_d)\big) + D_{KL}\big(p_{x_d}||q_{x_d}\big).

Here, the terms

D_{KL}\big(p(\cdot|x_d)||q(\cdot|x_d)\big)

are approximated via nearest neighbour- or kernel-based density estimates on the datasets X and Y stratified by the discrete variables, and

D_{KL}\big(p_{x_d}||q_{x_d}\big)

is approximated using relative frequencies.

Usage

kld_est(
  X,
  Y = NULL,
  q = NULL,
  estimator.continuous = kld_est_nn,
  estimator.discrete = kld_est_discrete,
  vartype = NULL
)

Arguments

X, Y

n-by-d and m-by-d data frames or matrices (multivariate samples), or numeric/character vectors (univariate samples, i.e. d = 1), representing n samples from the true distribution P and m samples from the approximate distribution Q in d dimensions. Y can be left blank if q is specified (see below).

q

The density function of the approximate distribution Q. Either Y or q must be specified. If the distributions are all continuous or all discrete, q can be directly specified as the probability density/mass function. However, for mixed continuous/discrete distributions, q must be given in decomposed form, q(y_c,y_d)=q_{c|d}(y_c|y_d)q_d(y_d), specified as a named list with field cond for the conditional density q_{c|d}(y_c|y_d) (a function that expects two arguments y_c and y_d) and disc for the discrete marginal density q_d(y_d) (a function that expects one argument y_d). If such a decomposition is not available, it may be preferable to instead simulate a large sample from Q and use the two-sample syntax.

estimator.continuous, estimator.discrete

KL divergence estimators for continuous and discrete data, respectively. Both are functions with two arguments X and Y or X and q, depending on whether a two-sample or one-sample problem is considered. Defaults are kld_est_nn and kld_est_discrete, respectively.

vartype

A length d character vector, with vartype[i] = "c" meaning the i-th variable is continuous, and vartype[i] = "d" meaning it is discrete. If unspecified, vartype is "c" for numeric columns and "d" for character or factor columns. This default will mostly work, except if levels of discrete variables are encoded using numbers (e.g., 0 for females and 1 for males) or for count data.

Value

A scalar, the estimated Kullback-Leibler divergence \hat D_{KL}(P||Q).

Examples

# 2D example, two samples
set.seed(0)
X <- data.frame(cont  = rnorm(10),
                discr = c(rep('a',4),rep('b',6)))
Y <- data.frame(cont  = c(rnorm(5), rnorm(5, sd = 2)),
                discr = c(rep('a',5),rep('b',5)))
kld_est(X, Y)

# 2D example, one sample
set.seed(0)
X <- data.frame(cont  = rnorm(10),
                discr = c(rep(0,4),rep(1,6)))
q <- list(cond = function(xc,xd) dnorm(xc, mean = xd, sd = 1),
          disc = function(xd) dbinom(xd, size = 1, prob = 0.5))
kld_est(X, q = q, vartype = c("c","d"))

[Package kldest version 1.0.0 Index]