distance {philentropy} | R Documentation |
Distances and Similarities between Probability Density Functions
Description
This functions computes the distance/dissimilarity between two probability density functions.
Usage
distance(
x,
method = "euclidean",
p = NULL,
test.na = TRUE,
unit = "log",
epsilon = 1e-05,
est.prob = NULL,
use.row.names = FALSE,
as.dist.obj = FALSE,
diag = FALSE,
upper = FALSE,
mute.message = FALSE
)
Arguments
x |
a numeric |
method |
a character string indicating whether the distance measure that should be computed. |
p |
power of the Minkowski distance. |
test.na |
a boolean value indicating whether input vectors should be tested for |
unit |
a character string specifying the logarithm unit that should be used to compute distances that depend on log computations. |
epsilon |
a small value to address cases in the distance computation where division by zero occurs. In
these cases, x / 0 or 0 / 0 will be replaced by |
est.prob |
method to estimate probabilities from input count vectors such as non-probability vectors. Default:
|
use.row.names |
a logical value indicating whether or not row names from
the input matrix shall be used as rownames and colnames of the output distance matrix. Default value is |
as.dist.obj |
shall the return value or matrix be an object of class |
diag |
if |
upper |
if |
mute.message |
a logical value indicating whether or not messages printed by |
Details
Here a distance is defined as a quantitative degree of how far two mathematical objects are apart from eachother (Cha, 2007).
This function implements the following distance/similarity measures to quantify the distance between probability density functions:
L_p Minkowski family
Euclidean :
d = sqrt( \sum | P_i - Q_i |^2)
Manhattan :
d = \sum | P_i - Q_i |
Minkowski :
d = ( \sum | P_i - Q_i |^p)^1/p
Chebyshev :
d = max | P_i - Q_i |
L_1 family
Sorensen :
d = \sum | P_i - Q_i | / \sum (P_i + Q_i)
Gower :
d = 1/d * \sum | P_i - Q_i |
Soergel :
d = \sum | P_i - Q_i | / \sum max(P_i , Q_i)
Kulczynski d :
d = \sum | P_i - Q_i | / \sum min(P_i , Q_i)
Canberra :
d = \sum | P_i - Q_i | / (P_i + Q_i)
Lorentzian :
d = \sum ln(1 + | P_i - Q_i |)
Intersection family
Intersection :
s = \sum min(P_i , Q_i)
Non-Intersection :
d = 1 - \sum min(P_i , Q_i)
Wave Hedges :
d = \sum | P_i - Q_i | / max(P_i , Q_i)
Czekanowski :
d = \sum | P_i - Q_i | / \sum | P_i + Q_i |
Motyka :
d = \sum min(P_i , Q_i) / (P_i + Q_i)
Kulczynski s :
d = 1 / \sum | P_i - Q_i | / \sum min(P_i , Q_i)
Tanimoto :
d = \sum (max(P_i , Q_i) - min(P_i , Q_i)) / \sum max(P_i , Q_i)
; equivalent to SoergelRuzicka :
s = \sum min(P_i , Q_i) / \sum max(P_i , Q_i)
; equivalent to 1 - Tanimoto = 1 - Soergel
Inner Product family
Inner Product :
s = \sum P_i * Q_i
Harmonic mean :
s = 2 * \sum (P_i * Q_i) / (P_i + Q_i)
Cosine :
s = \sum (P_i * Q_i) / sqrt(\sum P_i^2) * sqrt(\sum Q_i^2)
Kumar-Hassebrook (PCE) :
s = \sum (P_i * Q_i) / (\sum P_i^2 + \sum Q_i^2 - \sum (P_i * Q_i))
Jaccard :
d = 1 - \sum (P_i * Q_i) / (\sum P_i^2 + \sum Q_i^2 - \sum (P_i * Q_i))
; equivalent to 1 - Kumar-HassebrookDice :
d = \sum (P_i - Q_i)^2 / (\sum P_i^2 + \sum Q_i^2)
Squared-chord family
Fidelity :
s = \sum sqrt(P_i * Q_i)
Bhattacharyya :
d = - ln \sum sqrt(P_i * Q_i)
Hellinger :
d = 2 * sqrt( 1 - \sum sqrt(P_i * Q_i))
Matusita :
d = sqrt( 2 - 2 * \sum sqrt(P_i * Q_i))
Squared-chord :
d = \sum ( sqrt(P_i) - sqrt(Q_i) )^2
Squared L_2 family (
X
^2 squared family)Squared Euclidean :
d = \sum ( P_i - Q_i )^2
Pearson
X
^2 :d = \sum ( (P_i - Q_i )^2 / Q_i )
Neyman
X
^2 :d = \sum ( (P_i - Q_i )^2 / P_i )
Squared
X
^2 :d = \sum ( (P_i - Q_i )^2 / (P_i + Q_i) )
Probabilistic Symmetric
X
^2 :d = 2 * \sum ( (P_i - Q_i )^2 / (P_i + Q_i) )
Divergence :
X
^2 :d = 2 * \sum ( (P_i - Q_i )^2 / (P_i + Q_i)^2 )
Clark :
d = sqrt ( \sum (| P_i - Q_i | / (P_i + Q_i))^2 )
Additive Symmetric
X
^2 :d = \sum ( ((P_i - Q_i)^2 * (P_i + Q_i)) / (P_i * Q_i) )
Shannon's entropy family
Kullback-Leibler :
d = \sum P_i * log(P_i / Q_i)
Jeffreys :
d = \sum (P_i - Q_i) * log(P_i / Q_i)
K divergence :
d = \sum P_i * log(2 * P_i / P_i + Q_i)
Topsoe :
d = \sum ( P_i * log(2 * P_i / P_i + Q_i) ) + ( Q_i * log(2 * Q_i / P_i + Q_i) )
Jensen-Shannon :
d = 0.5 * ( \sum P_i * log(2 * P_i / P_i + Q_i) + \sum Q_i * log(2 * Q_i / P_i + Q_i))
Jensen difference :
d = \sum ( (P_i * log(P_i) + Q_i * log(Q_i) / 2) - (P_i + Q_i / 2) * log(P_i + Q_i / 2) )
Combinations
Taneja :
d = \sum ( P_i + Q_i / 2) * log( P_i + Q_i / ( 2 * sqrt( P_i * Q_i)) )
Kumar-Johnson :
d = \sum (P_i^2 - Q_i^2)^2 / 2 * (P_i * Q_i)^1.5
Avg(L_1, L_n) :
d = \sum | P_i - Q_i| + max{ | P_i - Q_i |} / 2
In cases where
x
specifies a count matrix, the argumentest.prob
can be selected to first estimate probability vectors from input count vectors and second compute the corresponding distance measure based on the estimated probability vectors.The following probability estimation methods are implemented in this function:
-
est.prob = "empirical"
: relative frequencies of counts.
Value
The following results are returned depending on the dimension of x
:
in case
nrow(x)
= 2 : a single distance value.in case
nrow(x)
> 2 : a distancematrix
storing distance values for all pairwise probability vector comparisons.
Note
According to the reference in some distance measure computations invalid computations can occur when dealing with 0 probabilities.
In these cases the convention is treated as follows:
division by zero - case
0/0
: when the divisor and dividend become zero,0/0
is treated as0
.division by zero - case
n/0
: when only the divisor becomes0
, the corresponsning0
is replaced by a small\epsilon = 0.00001
.log of zero - case
0 * log(0)
: is treated as0
.log of zero - case
log(0)
: zero is replaced by a small\epsilon = 0.00001
.
Author(s)
Hajk-Georg Drost
References
Sung-Hyuk Cha. (2007). Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions. International Journal of Mathematical Models and Methods in Applied Sciences 4: 1.
See Also
getDistMethods
, estimate.probability
, dist.diversity
Examples
# Simple Examples
# receive a list of implemented probability distance measures
getDistMethods()
## compute the euclidean distance between two probability vectors
distance(rbind(1:10/sum(1:10), 20:29/sum(20:29)), method = "euclidean")
## compute the euclidean distance between all pairwise comparisons of probability vectors
ProbMatrix <- rbind(1:10/sum(1:10), 20:29/sum(20:29),30:39/sum(30:39))
distance(ProbMatrix, method = "euclidean")
# compute distance matrix without testing for NA values in the input matrix
distance(ProbMatrix, method = "euclidean", test.na = FALSE)
# alternatively use the colnames of the input data for the rownames and colnames
# of the output distance matrix
ProbMatrix <- rbind(1:10/sum(1:10), 20:29/sum(20:29),30:39/sum(30:39))
rownames(ProbMatrix) <- paste0("Example", 1:3)
distance(ProbMatrix, method = "euclidean", use.row.names = TRUE)
# Specialized Examples
CountMatrix <- rbind(1:10, 20:29, 30:39)
## estimate probabilities from a count matrix
distance(CountMatrix, method = "euclidean", est.prob = "empirical")
## compute the euclidean distance for count data
## NOTE: some distance measures are only defined for probability values,
distance(CountMatrix, method = "euclidean")
## compute the Kullback-Leibler Divergence with different logarithm bases:
### case: unit = log (Default)
distance(ProbMatrix, method = "kullback-leibler", unit = "log")
### case: unit = log2
distance(ProbMatrix, method = "kullback-leibler", unit = "log2")
### case: unit = log10
distance(ProbMatrix, method = "kullback-leibler", unit = "log10")