divide_conquer_mds {bigmds}R Documentation

Divide-and-conquer MDS

Description

Roughly speaking, a large data set, x, of size n is divided into parts, then classical MDS is performed over every part and, finally, the partial configurations are combined so that all the points lie on the same coordinate system with the aim to obtain a global MDS configuration.

Usage

divide_conquer_mds(x, l, c_points, r, n_cores = 1, dist_fn = stats::dist, ...)

Arguments

x

A matrix with n points (rows) and k variables (columns).

l

The size for which classical MDS can be computed efficiently (using cmdscale function). It means that if \bar{l} is the limit size for which classical MDS is applicable, then l\leq \bar{l}.

c_points

Number of points used to align the MDS solutions obtained by the division of x into p data subsets. Recommended value: 2·r.

r

Number of principal coordinates to be extracted.

n_cores

Number of cores wanted to use to run the algorithm.

dist_fn

Distance function used to compute the distance between the rows.

...

Further arguments passed to dist_fn function.

Details

The divide-and-conquer MDS starts dividing the n points into p partitions: the first partition contains l points and the others contain l-c_points points. Therefore, p = 1 + (n-l)/(l-c_points). The partitions are created at random.

Once the partitions are created, c_points different random points are taken from the first partition and concatenated to the other partitions After that, classical MDS is applied to each partition, with target low dimensional configuration r.

Since all the partitions share c_points points with the first one, Procrustes can be applied in order to align all the configurations. Finally, all the configurations are concatenated in order to obtain a global MDS configuration.

Value

Returns a list containing the following elements:

points

A matrix that consists of n points (rows) and r variables (columns) corresponding to the principal coordinates. Since a dimensionality reduction is performed, r<<k

eigen

The first r largest eigenvalues: \bar{\lambda}_i, i \in \{1, \dots, r\} , where \bar{\lambda}_i = 1/p \sum_{j=1}^{p}\lambda_i^j/n_j, being \lambda_i^j the i-th eigenvalue from partition j and n_j the size of the partition j.

GOF

A numeric vector of length 2.

The first element corresponds to 1/n \sum_{j=1}^{p}n_jG_1^j, where G_1^j = \sum_{i = 1}^{r} \lambda_{i}^{j}/ \sum_{i = 1}^{n-1} |\lambda_{i}^{j}|.

The second element corresponds to 1/n \sum_{j=1}^{p}n_jG_2^j where G_2^j = \sum_{i = 1}^{r} \lambda_{i}^{j}/ \sum_{i = 1}^{n-1} max(\lambda_{i}^{j}, 0).

References

Delicado P. and C. Pachón-García (2021). Multidimensional Scaling for Big Data. https://arxiv.org/abs/2007.11919.

Borg, I. and P. Groenen (2005). Modern Multidimensional Scaling: Theory and Applications. Springer.

Examples

set.seed(42)
x <- matrix(data = rnorm(4*10000), nrow = 10000) %*% diag(c(9, 4, 1, 1))
mds <- divide_conquer_mds(x = x, l = 200, c_points = 2*2, r = 2, n_cores = 1, dist_fn = stats::dist)
head(mds$points)
mds$eigen
mds$GOF
points <- mds$points
plot(x[1:10, 1],
     x[1:10, 2],
     xlim = range(c(x[1:10,1],points[1:10,1])),
     ylim = range(c(x[1:10,2], points[1:10,2])),
     pch = 19,
     col = "green")
text(x[1:10, 1], x[1:10, 2], labels=1:10)
points(points[1:10, 1], points[1:10, 2], pch = 19, col = "orange")
text(points[1:10, 1], points[1:10, 2], labels=1:10)
abline(v = 0, lwd=3, lty=2)
abline(h = 0, lwd=3, lty=2)


[Package bigmds version 2.0.1 Index]