R: Kullback-Leibler divergence estimator for discrete,...

kld_est {kldest}

R Documentation

Kullback-Leibler divergence estimator for discrete, continuous or mixed data.

Description

For two mixed continuous/discrete distributions with densities p and q, and denoting x = (x_\text{c},x_\text{d}), the Kullback-Leibler divergence D_{KL}(p||q) is given as

D_{KL}(p||q) = \sum_{x_d} \int p(x_c,x_d) \log\left(\frac{p(x_c,x_d)}{q(x_c,x_d)}\right)dx_c.

Conditioning on the discrete variables x_d, this can be re-written as

D_{KL}(p||q) = \sum_{x_d} p(x_d) D_{KL}\big(p(\cdot|x_d)||q(\cdot|x_d)\big) + D_{KL}\big(p_{x_d}||q_{x_d}\big).

Here, the terms

D_{KL}\big(p(\cdot|x_d)||q(\cdot|x_d)\big)

are approximated via nearest neighbour- or kernel-based density estimates on the datasets X and Y stratified by the discrete variables, and

D_{KL}\big(p_{x_d}||q_{x_d}\big)

is approximated using relative frequencies.

Usage

kld_est(
  X,
  Y = NULL,
  q = NULL,
  estimator.continuous = kld_est_nn,
  estimator.discrete = kld_est_discrete,
  vartype = NULL
)

Arguments

`X`, `Y`	`n`-by-`d` and `m`-by-`d` data frames or matrices (multivariate samples), or numeric/character vectors (univariate samples, i.e. `d = 1`), representing `n` samples from the true distribution `P` and `m` samples from the approximate distribution `Q` in `d` dimensions. `Y` can be left blank if `q` is specified (see below).
`q`	The density function of the approximate distribution `Q`. Either `Y` or `q` must be specified. If the distributions are all continuous or all discrete, `q` can be directly specified as the probability density/mass function. However, for mixed continuous/discrete distributions, `q` must be given in decomposed form, `q(y_c,y_d)=q_{c\|d}(y_c\|y_d)q_d(y_d)`, specified as a named list with field `cond` for the conditional density `q_{c\|d}(y_c\|y_d)` (a function that expects two arguments `y_c` and `y_d`) and `disc` for the discrete marginal density `q_d(y_d)` (a function that expects one argument `y_d`). If such a decomposition is not available, it may be preferable to instead simulate a large sample from `Q` and use the two-sample syntax.
`estimator.continuous`, `estimator.discrete`	KL divergence estimators for continuous and discrete data, respectively. Both are functions with two arguments `X` and `Y` or `X` and `q`, depending on whether a two-sample or one-sample problem is considered. Defaults are `kld_est_nn` and `kld_est_discrete`, respectively.
`vartype`	A length `d` character vector, with `vartype[i] = "c"` meaning the `i`-th variable is continuous, and `vartype[i] = "d"` meaning it is discrete. If unspecified, `vartype` is `"c"` for numeric columns and `"d"` for character or factor columns. This default will mostly work, except if levels of discrete variables are encoded using numbers (e.g., `0` for females and `1` for males) or for count data.

Value

A scalar, the estimated Kullback-Leibler divergence \hat D_{KL}(P||Q).

Examples

# 2D example, two samples
set.seed(0)
X <- data.frame(cont  = rnorm(10),
                discr = c(rep('a',4),rep('b',6)))
Y <- data.frame(cont  = c(rnorm(5), rnorm(5, sd = 2)),
                discr = c(rep('a',5),rep('b',5)))
kld_est(X, Y)

# 2D example, one sample
set.seed(0)
X <- data.frame(cont  = rnorm(10),
                discr = c(rep(0,4),rep(1,6)))
q <- list(cond = function(xc,xd) dnorm(xc, mean = xd, sd = 1),
          disc = function(xd) dbinom(xd, size = 1, prob = 0.5))
kld_est(X, q = q, vartype = c("c","d"))

[Package kldest version 1.0.0 Index]