kld_est {kldest} | R Documentation |
Kullback-Leibler divergence estimator for discrete, continuous or mixed data.
Description
For two mixed continuous/discrete distributions with densities p
and
q
, and denoting x=(xc,xd)
, the Kullback-Leibler
divergence DKL(p∣∣q)
is given as
DKL(p∣∣q)=∑xd∫p(xc,xd)log(q(xc,xd)p(xc,xd))dxc.
Conditioning on the discrete variables xd
, this can be re-written as
DKL(p∣∣q)=∑xdp(xd)DKL(p(⋅∣xd)∣∣q(⋅∣xd))+DKL(pxd∣∣qxd).
Here, the terms
DKL(p(⋅∣xd)∣∣q(⋅∣xd))
are approximated via nearest neighbour- or kernel-based density estimates on
the datasets X
and Y
stratified by the discrete variables, and
DKL(pxd∣∣qxd)
is approximated using relative frequencies.
Usage
kld_est(
X,
Y = NULL,
q = NULL,
estimator.continuous = kld_est_nn,
estimator.discrete = kld_est_discrete,
vartype = NULL
)
Arguments
X , Y |
n -by-d and m -by-d data frames or matrices (multivariate
samples), or numeric/character vectors (univariate samples, i.e. d = 1 ),
representing n samples from the true distribution P and m
samples from the approximate distribution Q in d dimensions.
Y can be left blank if q is specified (see below).
|
q |
The density function of the approximate distribution Q . Either
Y or q must be specified. If the distributions are all continuous or
all discrete, q can be directly specified as the probability density/mass
function. However, for mixed continuous/discrete distributions, q must
be given in decomposed form, q(yc,yd)=qc∣d(yc∣yd)qd(yd) ,
specified as a named list with field cond for the conditional density
qc∣d(yc∣yd) (a function that expects two arguments y_c and
y_d ) and disc for the discrete marginal density qd(yd) (a
function that expects one argument y_d ). If such a decomposition is not
available, it may be preferable to instead simulate a large sample from
Q and use the two-sample syntax.
|
estimator.continuous , estimator.discrete |
KL divergence estimators for
continuous and discrete data, respectively. Both are functions with two
arguments X and Y or X and q , depending on whether a two-sample or
one-sample problem is considered. Defaults are kld_est_nn and
kld_est_discrete , respectively.
|
vartype |
A length d character vector, with vartype[i] = "c" meaning
the i -th variable is continuous, and vartype[i] = "d" meaning it is
discrete. If unspecified, vartype is "c" for numeric columns and "d"
for character or factor columns. This default will mostly work, except if
levels of discrete variables are encoded using numbers (e.g., 0 for
females and 1 for males) or for count data.
|
Value
A scalar, the estimated Kullback-Leibler divergence D^KL(P∣∣Q)
.
Examples
# 2D example, two samples
set.seed(0)
X <- data.frame(cont = rnorm(10),
discr = c(rep('a',4),rep('b',6)))
Y <- data.frame(cont = c(rnorm(5), rnorm(5, sd = 2)),
discr = c(rep('a',5),rep('b',5)))
kld_est(X, Y)
# 2D example, one sample
set.seed(0)
X <- data.frame(cont = rnorm(10),
discr = c(rep(0,4),rep(1,6)))
q <- list(cond = function(xc,xd) dnorm(xc, mean = xd, sd = 1),
disc = function(xd) dbinom(xd, size = 1, prob = 0.5))
kld_est(X, q = q, vartype = c("c","d"))
[Package
kldest version 1.0.0
Index]