emp_kl_div {representr} | R Documentation |
Calculate the empirical KL divergence for a representative dataset as compared to the true dataset
Description
Calculate the empirical KL divergence for a representative dataset as compared to the true dataset
Usage
emp_kl_div(
true_dat,
rep_dat,
categoric_vars,
numeric_vars,
l_m = 10,
weights = rep(1, nrow(rep_dat))
)
Arguments
true_dat |
The true dataset |
rep_dat |
A representative dataset |
categoric_vars |
A vector of column positions or column names for the categoric variables. |
numeric_vars |
A vector of column positions or column names for the numeric variables. |
l_m |
Approximate number of true data points to be in each bin for numeric variables. Default is 10. |
weights |
If weighted frequencies are desired, pass a vector weights of the same length as representative data points. |
Details
This function computes the estimated the KL divergence of two samples of data using the empirical distribution functions for the representative data set and true data set with continuous variables transformed to categorical using a histogram approach with statistically equivalent data-dependent bins, as detailed in
Wang, Qing, Sanjeev R. Kulkarni, and Sergio VerdĂș. "Divergence estimation of continuous distributions based on data-dependent partitions." IEEE Transactions on Information Theory 51.9 (2005): 3064-3074.
Examples
data("rl_reg1")
## random prototyping
rep_dat_random <- represent(rl_reg1, identity.rl_reg1, "proto_random", id = FALSE, parallel = FALSE)
## empirical KL divergence
cat_vars <- c("sex")
num_vars <- c("income", "bp")
emp_kl_div(rl_reg1[unique(identity.rl_reg1), c(cat_vars, num_vars)],
rep_dat_random[, c(cat_vars, num_vars)],
cat_vars, num_vars)