dist.matrix {wordspace} | R Documentation |
Distances/Similarities between Row or Column Vectors (wordspace)
Description
Compute a symmetric matrix of distances (or similarities) between the rows or columns of a matrix;
or compute cross-distances between the rows or columns of two different matrices.
This implementation is faster than dist
and can operate on sparse matrices (in canonical DSM format).
Usage
dist.matrix(M, M2 = NULL, method = "cosine", p = 2,
normalized = FALSE, byrow = TRUE, convert = TRUE, as.dist = FALSE,
terms = NULL, terms2 = terms, skip.missing = FALSE)
Arguments
M |
a dense or sparse matrix representing a scored DSM, or an object of class |
M2 |
an optional dense or sparse matrix representing a second scored DSM, or an object of class |
method |
distance or similarity measure to be used (see “Distance Measures” below for details) |
p |
exponent of the |
normalized |
if |
byrow |
whether to calculate distances between row vectors (default) or between column vectors ( |
convert |
if |
as.dist |
convert the full symmetric distance matrix to a compact object of class |
terms |
a character vector specifying rows of |
terms2 |
a character vector specifying rows of |
skip.missing |
if |
Value
By default, a numeric matrix of class dist.matrix
, specifying distances or similarities between term vectors.
A similarity matrix is marked by an additional attribute similarity
with value TRUE
.
If the distance or similarity matrix is symmetric (i.e. neither a cross-distance matrix nor based on an asymmetric distance measure), it is marked by an attribute symmetric
with value TRUE
.
If as.dist=TRUE
, the matrix is compacted to an object of class dist
.
Distance Measures
Given two DSM vectors and
, the following distance metrics can be computed:
euclidean
The Euclidean distance given by
manhattan
The Manhattan (or “city block”) distance given by
maximum
The maximum distance given by
minkowski
The Minkowski distance is a family of metrics determined by a parameter
, which encompasses the Euclidean, Manhattan and maximum distance as special cases. Also known as
-metric, it is defined by
for
and by
for
. In the latter case, it is not homogeneous and cannot be derived from a corresponding mathematical norm (cf.
rowNorms
).Special cases include the Euclidean metric
for
and the Manhattan metric
for
, but the dedicated methods above provide more efficient implementations. For
,
converges to the maximum distance
, which is also selected by setting
p=Inf
. For,
corresponds to the Hamming distance, i.e. the number of differences
canberra
The Canberra metric has been implemented for compatibility with the
dist
function, even though it is probably not very useful for DSM vectors. It is given by(see https://en.wikipedia.org/wiki/Canberra_distance). Terms with
are silently dropped from the summation.
Note that
dist
uses a different formulawhich is highly problematic unless
and
are guaranteed to be non-negative. Terms with
are imputed, i.e. set to the average value of all nonzero terms.
In addition, the following similarity measures can be computed and optionally converted to a distance metric (or dissimilarity):
cosine
(default)The cosine similarity given by
If
normalized=TRUE
, the denominator is omitted. Ifconvert=TRUE
(the default), the cosine similarity is converted to angular distance, given in degrees ranging from 0 to 180.
jaccard
The generalized Jaccard coefficient given by
which is only defined for non-negative vectors
and
. If
convert=TRUE
(the default), the Jaccard metricis returned (see Kosub 2016 for details). Note that
.
overlap
An asymmetric measure of overlap given by
for non-negative vectors
and
. If
convert=TRUE
(the default), the result is converted into a dissimilarity measure, which is not a metric, of course. Note that
and in particular
.
Overlap computes the proportion of the “mass” of
that is shared with
; as a consequence,
whenever
. If both vectors are normalized as probability distributions (
) then overlap is symmetric (
) and can be thought of as the shared probability mass of the two distributions. In this case,
normalized=TRUE
can be passed in order to simplify the computation to.
Author(s)
Stephanie Evert (https://purl.org/stephanie.evert)
See Also
plot
and head
methods for distance matrices; nearest.neighbours
and pair.distances
also accept a precomputed dist.matrix
object instead of a DSM matrix M
rowNorms
for length normalization of DSM vectors, which is highly recommended for most distance metrics (and implicit in cosine
)
Examples
M <- DSM_TermTermMatrix
dist.matrix(M, as.dist=TRUE) # angular distance
dist.matrix(M, method="euclidean", as.dist=TRUE) # Euclidean distance
dist.matrix(M, method="manhattan", as.dist=TRUE) # Manhattan distance
dist.matrix(M, method="minkowski", p=1, as.dist=TRUE) # L_1 distance
dist.matrix(M, method="minkowski", p=99, as.dist=TRUE) # almost L_Inf
dist.matrix(M, method="maximum", as.dist=TRUE) # L_Inf (maximum)
dist.matrix(M, method="minkowski", p=.5, as.dist=TRUE) # L_0.5 distance
dist.matrix(M, method="minkowski", p=0, as.dist=TRUE) # Hamming distance
round(dist.matrix(M, method="cosine", convert=FALSE), 3) # cosine similarity