dist.matrix {wordspace} | R Documentation |
Distances/Similarities between Row or Column Vectors (wordspace)
Description
Compute a symmetric matrix of distances (or similarities) between the rows or columns of a matrix;
or compute cross-distances between the rows or columns of two different matrices.
This implementation is faster than dist
and can operate on sparse matrices (in canonical DSM format).
Usage
dist.matrix(M, M2 = NULL, method = "cosine", p = 2,
normalized = FALSE, byrow = TRUE, convert = TRUE, as.dist = FALSE,
terms = NULL, terms2 = terms, skip.missing = FALSE)
Arguments
M |
a dense or sparse matrix representing a scored DSM, or an object of class |
M2 |
an optional dense or sparse matrix representing a second scored DSM, or an object of class |
method |
distance or similarity measure to be used (see “Distance Measures” below for details) |
p |
exponent of the |
normalized |
if |
byrow |
whether to calculate distances between row vectors (default) or between column vectors ( |
convert |
if |
as.dist |
convert the full symmetric distance matrix to a compact object of class |
terms |
a character vector specifying rows of |
terms2 |
a character vector specifying rows of |
skip.missing |
if |
Value
By default, a numeric matrix of class dist.matrix
, specifying distances or similarities between term vectors.
A similarity matrix is marked by an additional attribute similarity
with value TRUE
.
If the distance or similarity matrix is symmetric (i.e. neither a cross-distance matrix nor based on an asymmetric distance measure), it is marked by an attribute symmetric
with value TRUE
.
If as.dist=TRUE
, the matrix is compacted to an object of class dist
.
Distance Measures
Given two DSM vectors x
and y
, the following distance metrics can be computed:
euclidean
The Euclidean distance given by
d_2(x, y) = \sqrt{ \sum_i (x_i - y_i)^2 }
manhattan
The Manhattan (or “city block”) distance given by
d_1(x, y) = \sum_i |x_i - y_i|
maximum
The maximum distance given by
d_{\infty}(x, y) = \max_i |x_i - y_i|
minkowski
The Minkowski distance is a family of metrics determined by a parameter
0 \le p < \infty
, which encompasses the Euclidean, Manhattan and maximum distance as special cases. Also known asL_p
-metric, it is defined byd_p(x, y) = \left( \sum_i |x_i - y_i|^p \right)^{1/p}
for
p \ge 1
and byd_p(x, y) = \sum_i | x_i - y_i |^p
for
0 \le p < 1
. In the latter case, it is not homogeneous and cannot be derived from a corresponding mathematical norm (cf.rowNorms
).Special cases include the Euclidean metric
d_2(x, y)
forp = 2
and the Manhattan metricd_1(x, y)
forp = 1
, but the dedicated methods above provide more efficient implementations. Forp \to \infty
,d_p(x, y)
converges to the maximum distanced_{\infty}(x, y)
, which is also selected by settingp=Inf
. Forp = 0
,d_p(x, y)
corresponds to the Hamming distance, i.e. the number of differencesd_0(x, y) = \#\{ i | x_i \ne y_i \}
canberra
The Canberra metric has been implemented for compatibility with the
dist
function, even though it is probably not very useful for DSM vectors. It is given by\sum_i \frac{|x_i - y_i|}{|x_i| + |y_i|}
(see https://en.wikipedia.org/wiki/Canberra_distance). Terms with
x_i = y_i = 0
are silently dropped from the summation.Note that
dist
uses a different formula\sum_i \frac{|x_i - y_i|}{|x_i + y_i|}
which is highly problematic unless
x
andy
are guaranteed to be non-negative. Terms withx_i = y_i = 0
are imputed, i.e. set to the average value of all nonzero terms.
In addition, the following similarity measures can be computed and optionally converted to a distance metric (or dissimilarity):
cosine
(default)The cosine similarity given by
\cos \phi = \frac{x^T y}{||x||_2 \cdot ||y||_2}
If
normalized=TRUE
, the denominator is omitted. Ifconvert=TRUE
(the default), the cosine similarity is converted to angular distance\phi
, given in degrees ranging from 0 to 180.jaccard
The generalized Jaccard coefficient given by
J(x, y) = \frac{ \sum_i \min(x_i, y_i) }{ \sum_i \max(x_i, y_i) }
which is only defined for non-negative vectors
x
andy
. Ifconvert=TRUE
(the default), the Jaccard metric1 - J(x,y)
is returned (see Kosub 2016 for details). Note thatJ(0, 0) = 1
.overlap
An asymmetric measure of overlap given by
o(x, y) = \frac{ \sum_i \min(x_i, y_i) }{ \sum_i x_i }
for non-negative vectors
x
andy
. Ifconvert=TRUE
(the default), the result is converted into a dissimilarity measure1 - o(x,y)
, which is not a metric, of course. Note thato(0, y) = 1
and in particularo(0, 0) = 1
.Overlap computes the proportion of the “mass” of
x
that is shared withy
; as a consequence,o(x, y) = 1
wheneverx \le y
. If both vectors are normalized as probability distributions (||x||_1 = ||y||_1 = 1
) then overlap is symmetric (o(x, y) = o(y, x)
) and can be thought of as the shared probability mass of the two distributions. In this case,normalized=TRUE
can be passed in order to simplify the computation too(x, y) = \sum_i \min(x_i, y_i)
.
Author(s)
Stephanie Evert (https://purl.org/stephanie.evert)
See Also
plot
and head
methods for distance matrices; nearest.neighbours
and pair.distances
also accept a precomputed dist.matrix
object instead of a DSM matrix M
rowNorms
for length normalization of DSM vectors, which is highly recommended for most distance metrics (and implicit in cosine
)
Examples
M <- DSM_TermTermMatrix
dist.matrix(M, as.dist=TRUE) # angular distance
dist.matrix(M, method="euclidean", as.dist=TRUE) # Euclidean distance
dist.matrix(M, method="manhattan", as.dist=TRUE) # Manhattan distance
dist.matrix(M, method="minkowski", p=1, as.dist=TRUE) # L_1 distance
dist.matrix(M, method="minkowski", p=99, as.dist=TRUE) # almost L_Inf
dist.matrix(M, method="maximum", as.dist=TRUE) # L_Inf (maximum)
dist.matrix(M, method="minkowski", p=.5, as.dist=TRUE) # L_0.5 distance
dist.matrix(M, method="minkowski", p=0, as.dist=TRUE) # Hamming distance
round(dist.matrix(M, method="cosine", convert=FALSE), 3) # cosine similarity