metrics {EnvNJ} R Documentation

Pairwise Vector Dissimilarities

Description

Computes the dissimilarity between n-dimensional vectors.

Usage

metrics(vset, method = 'euclidean', p = 2)

Arguments

 vset matrix (n x m) where each column is a n-dimensional vector. method a character string indicating the distance/dissimilarity method to be used (see details). p power of the Minkowski distance. This parameter is only relevant if the method 'minkowski' has been selected.

Details

Although many of the offered methods compute a proper distance, that is not always the case. For instance, for a non null vector, v, the 'cosine' method gives d(v, 2v) = 0, violating the coincidence axiom. For that reason we prefer to use the term dissimilarity instead of distance. The methods offered can be grouped into families.

L_p family:

('euclidean', 'manhattan', 'minkowski', 'chebyshev')

Euclidean = sqrt( sum | P_i - Q_i |^2)

Manhattan = sum | P_i - Q_i |

Minkowski = ( sum| P_i - Q_i |^p)^1/p

Chebyshev = max | P_i - Q_i |

L_1 family:

('sorensen', 'soergel', 'lorentzian', 'kulczynski', 'canberra')

Sorensen = sum | P_i - Q_i | / sum (P_i + Q_i)

Soergel = sum | P_i - Q_i | / sum max(P_i , Q_i)

Lorentzian = sum ln(1 + | P_i - Q_i |)

Kulczynski = sum | P_i - Q_i | / sum min(P_i , Q_i)

Canberra = sum | P_i - Q_i | / (P_i + Q_i)

Intersection family:

('non-intersection', 'wavehedges', 'czekanowski', 'motyka')

Non-intersection = 1 - sum min(P_i , Q_i)

Wave-Hedges = sum | P_i - Q_i | / max(P_i , Q_i)

Czekanowski = sum | P_i - Q_i | / sum | P_i + Q_i |

Motyka = sum max(P_i , Q_i) / sum (P_i , Q_i)

Inner product family:

('cosine', 'jaccard')

Cosine = - ln(0.5 (1 + (P_i Q_i) / sqrt(sum P_i^2) sqrt(sum Q_i^2)))

Jaccard = 1 - sum (P_i Q_i) / (sum P_i^2 + sum Q_i^2 - sum (P_i Q_i))

Squared-chord family:

('bhattacharyya', 'squared_chord')

Bhattacharyya = - ln sum sqrt(P_i Q_i)

Squared-chord = sum ( sqrt(P_i) - sqrt(Q_i) )^2

Squared Chi family:

('squared_chi')

Squared-Chi = sum ( (P_i - Q_i )^2 / (P_i + Q_i) )

Shannon's entropy family:

('kullback-leibler', 'jeffreys', 'jensen-shannon', 'jensen_difference')

Kullback-Leibler = sum P_i * log(P_i / Q_i)

Jeffreys = sum (P_i - Q_i) * log(P_i / Q_i)

Jensen-Shannon = 0.5(sum P_i ln(2P_i / (P_i + Q_i)) + sum Q_i ln(2Q_i / (P_i + Q_i)))

Jensen difference = sum (0.5(P_i log(P_i) + Q_i log(Q_i)) - 0.5(P_i + Q_i) ln(0.5(P_i + Q_i))

Mismatch family:

('hamming', 'mismatch', 'mismatchZero', 'binary')

Hamming = (# coordinates where P_i != Q_i) / n

Mismatch = # coordinates where that P_i != Q_i

MismatchZero = Same as mismatch but after removing the coordinates where both vectors have zero.

Binary = (# coordinates where a vector has 0 and the other has a non-zero value) / n.

Combinations family:

('taneja', 'kumar-johnson', 'avg')

Taneja = sum ( P_i + Q_i / 2) log( P_i + Q_i / ( 2 sqrt( P_i * Q_i)) )

Kumar-Johnson = sum (P_i^2 - Q_i^2)^2 / 2 (P_i Q_i)^1.5

Avg = 0.5 (sum | P_i - Q_i| + max | P_i - Q_i |)

Value

A matrix with the computed dissimilarity values.

References

Sung-Hyuk Cha (2007). International Journal of Mathematical Models and Methods in Applied Sciences. Issue 4, vol. 1

Luczac et al. (2019). Briefings in Bioinformatics 20: 1222-1237.