distances {distances}R Documentation

Constructor for distance metric objects

Description

distances constructs a distance metric for a set of points. Currently, it only creates Euclidean distances. It can, however, create distances in any linear projection of Euclidean space. In other words, Mahalanobis distances or normalized Euclidean distances are both possible. It is also possible to give each dimension of the space different weights.

Usage

distances(
  data,
  id_variable = NULL,
  dist_variables = NULL,
  normalize = NULL,
  weights = NULL
)

Arguments

data

a matrix or data frame containing the data points between distances should be derived.

id_variable

optional IDs of the data points. If id_variable is a single string and data is a data frame, the corresponding column in data will be taken as IDs. That column will be excluded from data when constructing distances (unless it is listed in dist_variables). If id_variable is NULL, the IDs are set to 1:nrow(data). Otherwise, id_variable must be of length nrow(data) and will be used directly as IDs.

dist_variables

optional names of the columns in data that should be used when constructing distances. If dist_variables is NULL, all columns will be used (net of eventual column specified by id_variable). If data is a matrix, dist_variables must be NULL.

normalize

optional normalization of the data prior to distance construction. If normalize is NULL or "none", no normalization will be done (effectively setting normalize to the identity matrix). If normalize is "mahalanobize", normalization will be done with var(data) (i.e., resulting in Mahalanobis distances). If normalize is "studentize", normalization is done with the diagonal of var(data). If normalize is a matrix, it will be used in the normalization. If normalize is a vector, a diagonal matrix with the supplied vector as its diagonal will be used. The matrix used for normalization must be positive-semidefinite.

weights

optional weighting of the data prior to distance construction. If normalize is NULL no weighting will be done (effectively setting weights to the identity matrix). If weights is a matrix, that will be used in the weighting. If normalize is a vector, a diagonal matrix with the supplied vector as its diagonal will be used. The matrix used for weighting must be positive-semidefinite.

Details

Let x and y be two data points in data described by two vectors. distances uses the following metric to derive the distance between x and y:

\sqrt{(x - y) N^{-0.5} W (N^{-0.5})' (x - y)}

where N^{-0.5} is the Cholesky decomposition (lower triangular) of the inverse of the matrix speficied by normalize, and W is the matrix speficied by weights.

When normalize is var(data) (i.e., using the "mahalanobize" option), the function gives (weighted) Mahalanobis distances. When normalize is diag(var(data)) (i.e., using the "studentize" option), the function divides each column by its variance leading to (weighted) normalized Euclidean distances. If normalize is the identity matrix (i.e., using the "none" or NULL option), the function derives ordinary Euclidean distances.

Value

Returns a distances object.

Examples

my_data_points <- data.frame(x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
                             y = c(10, 9, 8, 7, 6, 6, 7, 8, 9, 10))

# Euclidean distances
my_distances1 <- distances(my_data_points)

# Euclidean distances in only one dimension
my_distances2 <- distances(my_data_points,
                           dist_variables = "x")

# Mahalanobis distances
my_distances3 <- distances(my_data_points,
                           normalize = "mahalanobize")

# Custom normalization matrix
my_norm_mat <- matrix(c(3, 1, 1, 3), nrow = 2)
my_distances4 <- distances(my_data_points,
                           normalize = my_norm_mat)

# Give "x" twice the weight compared to "y"
my_distances5 <- distances(my_data_points,
                           weights = c(2, 1))

# Use normalization and weighting
my_distances6 <- distances(my_data_points,
                           normalize = "mahalanobize",
                           weights = c(2, 1))

# Custom ID labels
my_data_points_withID <- data.frame(my_data_points,
                                    my_ids = letters[1:10])
my_distances7 <- distances(my_data_points_withID,
                           id_variable = "my_ids")



# Compare to standard R functions

all.equal(as.matrix(my_distances1), as.matrix(dist(my_data_points)))
# > TRUE

all.equal(as.matrix(my_distances2), as.matrix(dist(my_data_points[, "x"])))
# > TRUE

tmp_distances <- sqrt(mahalanobis(as.matrix(my_data_points),
                                  unlist(my_data_points[1, ]),
                                  var(my_data_points)))
names(tmp_distances) <- 1:10
all.equal(as.matrix(my_distances3)[1, ], tmp_distances)
# > TRUE

tmp_data_points <- as.matrix(my_data_points)
tmp_data_points[, 1] <- sqrt(2) * tmp_data_points[, 1]
all.equal(as.matrix(my_distances5), as.matrix(dist(tmp_data_points)))
# > TRUE

tmp_data_points <- as.matrix(my_data_points)
tmp_cov_mat <- var(tmp_data_points)
tmp_data_points[, 1] <- sqrt(2) * tmp_data_points[, 1]
tmp_distances <- sqrt(mahalanobis(tmp_data_points,
                                  tmp_data_points[1, ],
                                  tmp_cov_mat))
names(tmp_distances) <- 1:10
all.equal(as.matrix(my_distances6)[1, ], tmp_distances)
# > TRUE

tmp_distances <- as.matrix(dist(my_data_points))
colnames(tmp_distances) <- rownames(tmp_distances) <- letters[1:10]
all.equal(as.matrix(my_distances7), tmp_distances)
# > TRUE


[Package distances version 0.1.10 Index]