distance {analogue}  R Documentation 
Flexibly calculates distance or dissimilarity measures between a
training set x
and a fossil or test set y
. If
y
is not supplied then the pairwise dissimilarities between
samples in the training set, x
, are calculated.
distance(x, ...)
## Default S3 method:
distance(x, y, method = "euclidean", weights = NULL,
R = NULL, dist = FALSE, double.zero = FALSE, ...)
## S3 method for class 'join'
distance(x, ...)
oldDistance(x, ...)
## Default S3 method:
oldDistance(x, y, method = c("euclidean", "SQeuclidean",
"chord", "SQchord", "bray", "chi.square",
"SQchi.square", "information", "chi.distance",
"manhattan", "kendall", "gower", "alt.gower",
"mixed"),
fast = TRUE,
weights = NULL, R = NULL, ...)
## S3 method for class 'join'
oldDistance(x, ...)
x 
data frame or matrix containing the training set samples, or
and object of class 
y 
data frame or matrix containing the fossil or test set samples. 
method 
character; which choice of dissimilarity coefficient to use. One of the listed options. See Details below. 
weights 
numeric; vector of weights for each descriptor. 
R 
numeric; vector of ranges for each descriptor. 
dist 
logical; should the dissimilarity matrix be returned as
an object of class 
double.zero 
logical; if 
fast 
logical; should fast versions of the dissimilarities be calculated? See details below. 
... 
arguments passed to other methods 
A range of dissimilarity coefficients can be used to calculate dissimilarity between samples. The following are currently available:
euclidean
 d_{jk} = \sqrt{\sum_i (x_{ij}x_{ik})^2}

SQeuclidean
 d_{jk} = \sum_i (x_{ij}x_{ik})^2

chord
 d_{jk} = \sqrt{\sum_i
(\sqrt{x_{ij}}\sqrt{x_{ik}})^2}

SQchord
 d_{jk} = \sum_i (\sqrt{x_{ij}}\sqrt{x_{ik}})^2

bray
 d_{jk} = \frac{\sum_i x_{ij}  x_{ik}}{\sum_i (x_{ij} +
x_{ik})}

chi.square
 d_{jk} = \sqrt{\sum_i \frac{(x_{ij}  x_{ik})^2}{x_{ij} +
x_{ik}}}

SQchi.square
 d_{jk} = \sum_i \frac{(x_{ij}  x_{ik})^2}{x_{ij} +
x_{ik}}

information
 d_{jk} = \sum_i (p_{ij}log(\frac{2p_{ij}}{p_{ij} + p_{ik}})
+ p_{ik}log(\frac{2p_{ik}}{p_{ij} + p_{ik}}))

chi.distance
 d_{jk} = \sqrt{\sum_i (x_{ij}x_{ik})^2 / (x_{i+} /
x_{++})}

manhattan
 d_{jk} = \sum_i (x_{ij}x_{ik})

kendall
 d_{jk} = \sum_i MAX_i  minimum(x_{ij}, x_{ik})

gower
 d_{jk} = \sum_i\frac{p_{ij} 
p_{ik}}{R_i}

alt.gower
 d_{jk} = \sqrt{2\sum_i\frac{p_{ij} 
p_{ik}}{R_i}}

where R_i is the range of proportions for
descriptor (variable) i


mixed
 d_{jk} = \frac{\sum_{i=1}^p w_{i}s_{jki}}{\sum_{i=1}^p
w_{i}}

where w_i is the weight for descriptor i and
s_{jki} is the similarity 

between samples j and k for descriptor (variable)
i .


metric.mixed
 as for mixed but with ordinal variables converted to
ranks and handled as quantitative variables in Gower's mixed
coefficient.

Argument fast
determines whether fast C versions of some of the
dissimilarity coefficients are used. The fast versions make use of
dist
for method
s "euclidean"
,
"SQeuclidean"
, "chord"
, "SQchord"
, and
vegdist
for method
== "bray"
. These
fast versions are used only when x
is supplied, not when
y
is also supplied. Future versions of distance
will
include fast C versions of all the dissimilary coefficients and for
cases where y
is supplied.
A matrix of dissimilarities where columns are the samples in
y
and the rows the samples in x
. If y
is
not provided then a square, symmetric matrix of pairwise sample
dissimilarities for the training set x
is returned, unless
argument dist
is TRUE
, in which case an object of class
"dist"
is returned. See dist
.
The dissimilarity coefficient used (method
) is returned as
attribute "method"
. Attribute "type"
indicates whether
the object was computed on a single data matrix ("symmetric"
)
or across two matrices (i.e. the dissimilarties between the rows of
two matrices; "asymmetric"
.
For method = "mixed"
it is essential that a factor in x
and y
have the same levels in the two data frames. Previous
versions of analogue would work even if this was not the case, which
will have generated incorrect dissimilarities for method =
"mixed"
for cases where factors for a given species had different
levels in x
to y
.
distance
now checks for matching levels for each species
(column) recorded as a factor. If the factor for any individual
species has different levels in x
and y
, an error will
be issued.
The dissimilarities are calculated in native R code. As such, other
implementations (see See Also below) will be quicker. This is done for
one main reason  it is hoped to allow a user defined function to be
supplied as argument "method"
to allow for userextension of
the available coefficients.
The other advantage of distance
over other implementations, is
the simplicity of calculating only the required pairwise sample
dissimilarities between each fossil sample (y
) and each
training set sample (x
). To do this in other implementations,
you would need to merge the two sets of samples, calculate the full
dissimilarity matrix and then subset it to achieve similar results.
Gavin L. Simpson and Jari Oksanen (improvements leading to
method "metric.mixed"
and proper handling of ordinal data via
Podani's (1999) modification of Gower's general coefficient in method
"mixed"
).
Faith, D.P., Minchin, P.R. and Belbin, L. (1987) Compositional dissimilarity as a robust measure of ecological distance. Vegetatio 69, 57–68.
Gavin, D.G., Oswald, W.W., Wahl, E.R. and Williams, J.W. (2003) A statistical approach to evaluating distance metrics and analog assignments for pollen records. Quaternary Research 60, 356–367.
Kendall, D.G. (1970) A mathematical approach to seriation. Philosophical Transactions of the Royal Society of London  Series B 269, 125–135.
Legendre, P. and Legendre, L. (1998) Numerical Ecology, 2nd English Edition. Elsevier Science BV, The Netherlands.
Overpeck, J.T., Webb III, T. and Prentice I.C. (1985) Quantitative interpretation of fossil pollen spectra: dissimilarity coefficients and the method of modern analogues. Quaternary Research 23, 87–108.
Podani, J. (1999) Extending Gower's General Coefficient of Similarity to Ordinal Characters. Taxon 48, 331–340).
Prentice, I.C. (1980) Multidimensional scaling as a research tool in Quaternary palynology: a review of theory and methods. Review of Palaeobiology and Palynology 31, 71–104.
vegdist
in package vegan,
daisy
in package cluster, and
dist
provide comparable functionality for the
case of missing y
.
## simple example using dummy data
train < data.frame(matrix(abs(runif(200)), ncol = 10))
rownames(train) < LETTERS[1:20]
colnames(train) < as.character(1:10)
fossil < data.frame(matrix(abs(runif(100)), ncol = 10))
colnames(fossil) < as.character(1:10)
rownames(fossil) < letters[1:10]
## calculate distances/dissimilarities between train and fossil
## samples
test < distance(train, fossil)
## using a different coefficient, chisquare distance
test < distance(train, fossil, method = "chi.distance")
## calculate pairwise distances/dissimilarities for training
## set samples
test2 < distance(train)
## Using distance on an object of class join
dists < distance(join(train, fossil))
str(dists)
## calculate Gower's general coefficient for mixed data
## first, make a couple of variables factors
## fossil[,4] < factor(sample(rep(1:4, length = 10), 10))
## train[,4] < factor(sample(rep(1:4, length = 20), 20))
## ## now fit the mixed coefficient
## test3 < distance(train, fossil, "mixed")
## ## Example from page 260 of Legendre & Legendre (1998)
x1 < t(c(2,2,NA,2,2,4,2,6))
x2 < t(c(1,3,3,1,2,2,2,5))
Rj < c(1,4,2,4,1,3,2,5) # supplied ranges
## 1  distance(x1, x2, method = "mixed", R = Rj)
## note this gives ~0.66 as Legendre & Legendre describe the
## coefficient as a similarity coefficient. Hence here we do
## 1  Dij here to get the same answer.
## Tortula example from Podani (1999)
data(tortula)
Dij < distance(tortula[, 1], method = "mixed") # col 1 includes Taxon ID
## Only one ordered factor
data(mite.env, package = "vegan")
Dij < distance(mite.env, method = "mixed")
## Some variables are constant
data(BCI.env, package = "vegan")
Dij < distance(BCI.env, method = "mixed")