R: Methods for Computing Similarity Matrices

similarities {apcluster}

R Documentation

Methods for Computing Similarity Matrices

Description

Compute similarity matrices from data set

Usage

negDistMat(x, sel=NA, r=1, method="euclidean", p=2)
expSimMat(x, sel=NA, r=2, w=1, method="euclidean", p=2)
linSimMat(x, sel=NA, w=1, method="euclidean", p=2)
corSimMat(x, sel=NA, r=1, signed=TRUE, method="pearson")
linKernel(x, sel=NA, normalize=FALSE)

Arguments

`x`	input data to be clustered; if `x` is a vector, it is interpreted as a list of scalar values that are to be clustered; if `x` is a matrix or data frame, rows are interpreted as samples and columns are interpreted as features; in the case that `x` is a data frame, only numerical columns/features are taken into account, whereas categorical features are neglected. If `x` is missing, all functions return a function that can be used as similarity measure, in particular, as `s` argument for `apclusterL`.
`sel`	selected samples subset; vector of row indices for x in increasing order (see details below)
`r`	exponent (see details below)
`w`	radius (see details below)
`signed`	take sign of correlation into account (see details below)
`normalize`	see details below
`method`	type of distance measure to be used; for `negDistMat`, `expSimMat` and `linSimMat`, this argument is analogous to the `method` argument of `dist`. For `corSimMat`, this argument is analogous to the `method` argument of `cor`.
`p`	exponent for Minkowski distance; only used for `method="minkowski"`, otherwise ignored. See `dist`.

Details

negDistMat creates a square matrix of mutual pairwise similarities of data vectors as negative distances. The argument r (default is 1) is used to transform the resulting distances by computing the r-th power (use r=2 to obtain negative squared distances as in Frey's and Dueck's demos), i.e., given a distance d, the resulting similarity is computed as s=-d^r. With the parameter sel a subset of samples can be specified for distance calculation. In this case not the full distance matrix is computed but a rectangular similarity matrix of all samples (rows) against the subset (cols) as needed for leveraged clustering. Internally, the computation of distances is done using an internal method derived from dist. All options of this function except diag and upper can be used, especially method which allows for selecting different distance measures. Note that, since version 1.4.4. of the package, there is an additional method "discrepancy" that implements Weyl's discrepancy measure.

expSimMat computes similarities in a way similar to negDistMat, but the transformation of distances to similarities is done in the following way:

s=\exp\left(-\left(\frac{d}{w}\right)^r\right)

The parameter sel allows the creation of a rectangular similarity matrix. As above, r is an exponent. The parameter w controls the speed of descent. r=2 in conjunction with Euclidean distances corresponds to the well-known Gaussian/RBF kernel, whereas r=1 corresponds to the Laplace kernel. Note that these similarity measures can also be understood as fuzzy equality relations.

linSimMat provides another way of transforming distances into similarities by applying the following transformation to a distance d:

s=\max\left(0,1-\frac{d}{w}\right)

Thw parameter sel is used again for creation of a rectangular similarity matrix. Here w corresponds to a maximal radius of interest. Note that this is a fuzzy equality relation with respect to the Lukasiewicz t-norm.

Unlike the above three functions, linKernel computes pairwise similarities as scalar products of data vectors, i.e. it corresponds, as the name suggests, to the “linear kernel”. Use parameter sel to compute only a submatrix of the full kernel matrix as described above. If normalize=TRUE, the values are scaled to the unit sphere in the following way (for two samples x and y:

s=\frac{\vec{x}^T\vec{y}}{\|\vec{x}\| \|\vec{y}\|}

The function corSimMat computes pairwise similarities as correlations. It uses link[stats:cor]{cor} internally. The method argument is passed on to link[stats:cor]{cor}. The argument r serves as an exponent with which the correlations can be transformed. If signed=TRUE (default), negative correlations are taken into account, i.e. two samples are maximally dissimilar if they are negatively correlated. If signed=FALSE, similarities are computed as absolute values of correlations, i.e. two samples are maximally similar if they are positively or negatively correlated and the two samples are maximally dissimilar if they are uncorrelated.

Note that the naming of the argument p has been chosen for consistency with dist and previous versions of the package. When using leveraged AP in conjunction with the Minkowski distance, this leads to conflicts with the input preference parameter p of apclusterL. In order to avoid that, use the above functions without x argument to create a custom similarity measure with fixed parameter p (see example below).

Value

All functions listed above return square or rectangular matrices of similarities.

Author(s)

Ulrich Bodenhofer, Andreas Kothmeier, and Johannes Palme

References

https://github.com/UBod/apcluster

Bodenhofer, U., Kothmeier, A., and Hochreiter, S. (2011) APCluster: an R package for affinity propagation clustering. Bioinformatics 27, 2463-2464. DOI: doi:10.1093/bioinformatics/btr406.

Frey, B. J. and Dueck, D. (2007) Clustering by passing messages between data points. Science 315, 972-976. DOI: doi:10.1126/science.1136800.

Micchelli, C. A. (1986) Interpolation of scattered data: distance matrices and conditionally positive definite functions. Constr. Approx. 2, 11-20.

De Baets, B. and Mesiar, R. (1997) Pseudo-metrics and T-equivalences. J. Fuzzy Math. 5, 471-481.

Bauer, P., Bodenhofer, U., and Klement, E. P. (1996) A fuzzy algorithm for pixel classification based on the discrepancy norm. In Proc. 5th IEEE Int. Conf. on Fuzzy Systems, volume III, pages 2007–2012, New Orleans, LA. DOI: doi:10.1109/FUZZY.1996.552744.

Examples

## create two Gaussian clouds
cl1 <- cbind(rnorm(100, 0.2, 0.05), rnorm(100, 0.8, 0.06))
cl2 <- cbind(rnorm(100, 0.7, 0.08), rnorm(100, 0.3, 0.05))
x <- rbind(cl1, cl2)

## create negative distance matrix (default Euclidean)
sim1 <- negDistMat(x)

## compute similarities as squared negative distances
## (in accordance with Frey's and Dueck's demos)
sim2 <- negDistMat(x, r=2)

## compute RBF kernel
sim3 <- expSimMat(x, r=2)

## compute similarities as squared negative distances
## all samples versus a randomly chosen subset 
## of 50 samples (for leveraged AP clustering)
sel <- sort(sample(1:nrow(x), nrow(x)*0.25)) 
sim4 <- negDistMat(x, sel, r=2)


## example of leveraged AP using Minkowski distance with non-default
## parameter p
cl1 <- cbind(rnorm(150, 0.2, 0.05), rnorm(150, 0.8, 0.06))
cl2 <- cbind(rnorm(100, 0.7, 0.08), rnorm(100, 0.3, 0.05))
x <- rbind(cl1, cl2)

apres <- apclusterL(s=negDistMat(method="minkowski", p=2.5, r=2),
                       x, frac=0.2, sweeps=3, p=-0.2)
show(apres)

[Package apcluster version 1.4.13 Index]