summ_distance {pdqr} | R Documentation |
Summarize pair of distributions with distance
Description
This function computes distance between two distributions represented by pdqr-functions. Here "distance" is used in a broad sense: a single non-negative number representing how much two distributions differ from one another. Bigger values indicate bigger difference. Zero value means that input distributions are equivalent based on the method used (except method "avgdist" which is almost always returns positive value). The notion of "distance" is useful for doing statistical inference about similarity of two groups of numbers.
Usage
summ_distance(f, g, method = "KS")
Arguments
f |
|
g |
A pdqr-function of any type and class. |
method |
Method for computing distance. Should be one of "KS", "totvar", "compare", "wass", "cramer", "align", "avgdist", "entropy". |
Details
Methods can be separated into three categories: probability based, metric based, and entropy based.
Probability based methods return a number between 0 and 1 which is computed in the way that mostly based on probability:
-
Method "KS" (short for Kolmogorov-Smirnov) computes the supremum of absolute difference between p-functions corresponding to
f
andg
(|F - G|
). Here "supremum" is meant to describe the fact that if input functions have different types, there can be no point at which "KS" distance is achieved. Instead, there might be a sequence of points from left to right with|F - G|
values tending to the result (see Examples). -
Method "totvar" (short for "total variation") computes a biggest absolute difference of probabilities for any subset of real line. In other words, there is a set of points for "discrete" type and intervals for "continuous", total probability of which under
f
andg
differs the most. Note that iff
andg
have different types, output is always 1. The set of interest consists from all "x" values of "discrete" pdqr-function: probability under "discrete" distribution is 1 and under "continuous" is 0. -
Method "compare" represents a value computed based on probabilities of one distribution being bigger than the other (see pdqr methods for "Ops" group generic family for more details on comparing pdqr-functions). It is computed as
2*max(P(F > G), P(F < G)) + 0.5*P(F = G) - 1
(hereP(F > G)
is basicallysumm_prob_true(f > g)
). This is maximum of two values (P(F > G) + 0.5*P(F = G)
andP(F < G) + 0.5*P(F = G)
), normalized to return values from 0 to 1. Other way to look at this measure is that it computes (before normalization) two ROC AUC values with method"expected"
for two possible ordering (f, g
, andg, f
) and takes their maximum.
Metric based methods compute "how far" two distributions are apart on the real line:
-
Method "wass" (short for "Wasserstein") computes a 1-Wasserstein distance: "minimum cost of 'moving' one density into another", or "average path density point should go while transforming from one into another". It is computed as integral of
|F - G|
(absolute difference between p-functions). If any off
andg
has "continuous" type,stats::integrate()
is used, so relatively small numerical errors can happen. -
Method "cramer" computes Cramer distance: integral of
(F - G)^2
. This somewhat relates to "wass" method as variance relates to first central absolute moment. Relatively small numerical errors can happen. -
Method "align" computes an absolute value of shift
d
(possibly negative) that should be added tof
to achieve bothP(f+d >= g) >= 0.5
andP(f+d <= g) >= 0.5
(in other words, alignf+d
andg
) as close as reasonably possible. Solution is found numerically withstats::uniroot()
, so relatively small numerical errors can happen. Also note that this method is somewhat slow (compared to all others). To increase speed, use less elements in "x_tbl" metadata. For example, withform_retype()
or smallern_grid
argument in as_*() functions. -
Method "avgdist" computes average distance between sample values from inputs. Basically, it is a deterministically computed approximation of expected value of absolute difference between random variables, or in 'pdqr' code:
summ_mean(abs(f - g))
(but computed without randomness). Computation is done by approximating possibly present continuous pdqr-functions with discrete ones (see description of "pdqr.approx_discrete_n_grid" option for more information) and then computing output value directly based on two discrete pdqr-functions. Note that this method almost never returns zero, even for identical inputs (except the case of discrete pdqr-functions with identical one value).
Entropy based methods compute output based on entropy characteristics:
-
Method "entropy" computes sum of two Kullback-Leibler divergences:
KL(f, g) + KL(g, f)
, which are outputs ofsumm_entropy2()
with method "relative". Notes:If
f
andg
don't have the same support, distance can be very high.Error is thrown if
f
andg
have different types (the same as insumm_entropy2()
).
Value
A single non-negative number representing distance between pair of distributions. For methods "KS", "totvar", and "compare" it is not bigger than 1. For method "avgdist" it is almost always bigger than 0.
See Also
summ_separation()
for computation of optimal threshold separating
pair of distributions.
Other summary functions:
summ_center()
,
summ_classmetric()
,
summ_entropy()
,
summ_hdr()
,
summ_interval()
,
summ_moment()
,
summ_order()
,
summ_prob_true()
,
summ_pval()
,
summ_quantile()
,
summ_roc()
,
summ_separation()
,
summ_spread()
Examples
d_unif <- as_d(dunif, max = 2)
d_norm <- as_d(dnorm, mean = 1)
vapply(
c(
"KS", "totvar", "compare",
"wass", "cramer", "align", "avgdist",
"entropy"
),
function(meth) {
summ_distance(d_unif, d_norm, method = meth)
},
numeric(1)
)
# "Supremum" quality of "KS" distance
d_dis <- new_d(2, "discrete")
## Distance is 1, which is a limit of |F - G| at points which tend to 2 from
## left
summ_distance(d_dis, d_unif, method = "KS")