| dist.matrix {wordspace} | R Documentation |
Distances/Similarities between Row or Column Vectors (wordspace)
Description
Compute a symmetric matrix of distances (or similarities) between the rows or columns of a matrix;
or compute cross-distances between the rows or columns of two different matrices.
This implementation is faster than dist and can operate on sparse matrices (in canonical DSM format).
Usage
dist.matrix(M, M2 = NULL, method = "cosine", p = 2,
normalized = FALSE, byrow = TRUE, convert = TRUE, as.dist = FALSE,
terms = NULL, terms2 = terms, skip.missing = FALSE)
Arguments
M |
a dense or sparse matrix representing a scored DSM, or an object of class |
M2 |
an optional dense or sparse matrix representing a second scored DSM, or an object of class |
method |
distance or similarity measure to be used (see “Distance Measures” below for details) |
p |
exponent of the |
normalized |
if |
byrow |
whether to calculate distances between row vectors (default) or between column vectors ( |
convert |
if |
as.dist |
convert the full symmetric distance matrix to a compact object of class |
terms |
a character vector specifying rows of |
terms2 |
a character vector specifying rows of |
skip.missing |
if |
Value
By default, a numeric matrix of class dist.matrix, specifying distances or similarities between term vectors.
A similarity matrix is marked by an additional attribute similarity with value TRUE.
If the distance or similarity matrix is symmetric (i.e. neither a cross-distance matrix nor based on an asymmetric distance measure), it is marked by an attribute symmetric with value TRUE.
If as.dist=TRUE, the matrix is compacted to an object of class dist.
Distance Measures
Given two DSM vectors x and y, the following distance metrics can be computed:
euclideanThe Euclidean distance given by
d_2(x, y) = \sqrt{ \sum_i (x_i - y_i)^2 }manhattanThe Manhattan (or “city block”) distance given by
d_1(x, y) = \sum_i |x_i - y_i|maximumThe maximum distance given by
d_{\infty}(x, y) = \max_i |x_i - y_i|minkowskiThe Minkowski distance is a family of metrics determined by a parameter
0 \le p < \infty, which encompasses the Euclidean, Manhattan and maximum distance as special cases. Also known asL_p-metric, it is defined byd_p(x, y) = \left( \sum_i |x_i - y_i|^p \right)^{1/p}for
p \ge 1and byd_p(x, y) = \sum_i | x_i - y_i |^pfor
0 \le p < 1. In the latter case, it is not homogeneous and cannot be derived from a corresponding mathematical norm (cf.rowNorms).Special cases include the Euclidean metric
d_2(x, y)forp = 2and the Manhattan metricd_1(x, y)forp = 1, but the dedicated methods above provide more efficient implementations. Forp \to \infty,d_p(x, y)converges to the maximum distanced_{\infty}(x, y), which is also selected by settingp=Inf. Forp = 0,d_p(x, y)corresponds to the Hamming distance, i.e. the number of differencesd_0(x, y) = \#\{ i | x_i \ne y_i \}canberraThe Canberra metric has been implemented for compatibility with the
distfunction, even though it is probably not very useful for DSM vectors. It is given by\sum_i \frac{|x_i - y_i|}{|x_i| + |y_i|}(see https://en.wikipedia.org/wiki/Canberra_distance). Terms with
x_i = y_i = 0are silently dropped from the summation.Note that
distuses a different formula\sum_i \frac{|x_i - y_i|}{|x_i + y_i|}which is highly problematic unless
xandyare guaranteed to be non-negative. Terms withx_i = y_i = 0are imputed, i.e. set to the average value of all nonzero terms.
In addition, the following similarity measures can be computed and optionally converted to a distance metric (or dissimilarity):
cosine(default)The cosine similarity given by
\cos \phi = \frac{x^T y}{||x||_2 \cdot ||y||_2}If
normalized=TRUE, the denominator is omitted. Ifconvert=TRUE(the default), the cosine similarity is converted to angular distance\phi, given in degrees ranging from 0 to 180.jaccardThe generalized Jaccard coefficient given by
J(x, y) = \frac{ \sum_i \min(x_i, y_i) }{ \sum_i \max(x_i, y_i) }which is only defined for non-negative vectors
xandy. Ifconvert=TRUE(the default), the Jaccard metric1 - J(x,y)is returned (see Kosub 2016 for details). Note thatJ(0, 0) = 1.overlapAn asymmetric measure of overlap given by
o(x, y) = \frac{ \sum_i \min(x_i, y_i) }{ \sum_i x_i }for non-negative vectors
xandy. Ifconvert=TRUE(the default), the result is converted into a dissimilarity measure1 - o(x,y), which is not a metric, of course. Note thato(0, y) = 1and in particularo(0, 0) = 1.Overlap computes the proportion of the “mass” of
xthat is shared withy; as a consequence,o(x, y) = 1wheneverx \le y. If both vectors are normalized as probability distributions (||x||_1 = ||y||_1 = 1) then overlap is symmetric (o(x, y) = o(y, x)) and can be thought of as the shared probability mass of the two distributions. In this case,normalized=TRUEcan be passed in order to simplify the computation too(x, y) = \sum_i \min(x_i, y_i).
Author(s)
Stephanie Evert (https://purl.org/stephanie.evert)
See Also
plot and head methods for distance matrices; nearest.neighbours and pair.distances also accept a precomputed dist.matrix object instead of a DSM matrix M
rowNorms for length normalization of DSM vectors, which is highly recommended for most distance metrics (and implicit in cosine)
Examples
M <- DSM_TermTermMatrix
dist.matrix(M, as.dist=TRUE) # angular distance
dist.matrix(M, method="euclidean", as.dist=TRUE) # Euclidean distance
dist.matrix(M, method="manhattan", as.dist=TRUE) # Manhattan distance
dist.matrix(M, method="minkowski", p=1, as.dist=TRUE) # L_1 distance
dist.matrix(M, method="minkowski", p=99, as.dist=TRUE) # almost L_Inf
dist.matrix(M, method="maximum", as.dist=TRUE) # L_Inf (maximum)
dist.matrix(M, method="minkowski", p=.5, as.dist=TRUE) # L_0.5 distance
dist.matrix(M, method="minkowski", p=0, as.dist=TRUE) # Hamming distance
round(dist.matrix(M, method="cosine", convert=FALSE), 3) # cosine similarity