interpolation_mds {bigmds} R Documentation

Interpolation MDS

Description

Given that the size of the data set is too large, this algorithm consists of taking a random sample from it of size l \leq \bar{l}, being \bar{l} the limit size for which classical MDS is applicable, to perform classical MDS to it, and to extend the obtained results to the rest of the data set by using Gower's interpolation formula, which allows to add a new set of points to an existing MDS configuration.

Usage

interpolation_mds(x, l, r, n_cores = 1, dist_fn = stats::dist, ...)


Arguments

 x A matrix with n individuals (rows) and k variables (columns). l The size for which classical MDS can be computed efficiently (using cmdscale function). It means that if \bar{l} is the limit size for which classical MDS is applicable, then l\leq \bar{l}. r Number of principal coordinates to be extracted. n_cores Number of cores wanted to use to run the algorithm. dist_fn Distance function used to compute the distance between the rows. ... Further arguments passed to dist_fn function.

Details

Gower's interpolation formula is the central piece of this algorithm since it allows to add a new set of points to an existing MDS configuration so that the new one has the same coordinate system.

Given the matrix x with n points (rows) and and k variables (columns), a first data subsets (based on a random sample) of size l is taken and it is used to compute a MDS configuration.

The remaining part of x is divided into p=({n}-l)/l data subsets (randomly). For every data subset, it is obtained a MDS configuration by means of Gower's interpolation formula and the first MDS configuration obtained previously. Every MDS configuration is appended to the existing one so that, at the end of the process, a global MDS configuration for x is obtained.

Value

Returns a list containing the following elements:

points

A matrix that consists of n individuals (rows) and r variables (columns) corresponding to the principal coordinates. Since we are performing a dimensionality reduction, r<<k

eigen

The first r largest eigenvalues: \lambda_i, i \in \{1, \dots, r\} , where each \lambda_i is obtained from applying classical MDS to the first data subset.

GOF

A numeric vector of length 2.

The first element corresponds to \sum_{i = 1}^{r} \lambda_{i}/ \sum_{i = 1}^{n-1} |\lambda_{i}|.

The second element corresponds to \sum_{i = 1}^{r} \lambda_{i}/ \sum_{i = 1}^{n-1} max(\lambda_{i}, 0).

References

Delicado P. and C. Pachón-García (2021). Multidimensional Scaling for Big Data. https://arxiv.org/abs/2007.11919.

Gower, J. C. and D. J. Hand (1995). Biplots, Volume 54. CRC Press.

Borg, I. and P. Groenen (2005). Modern Multidimensional Scaling: Theory and Applications. Springer.

Examples

set.seed(42)
x <- matrix(data = rnorm(4*10000), nrow = 10000) %*% diag(c(9, 4, 1, 1))
mds <- interpolation_mds(x = x, l = 200, r = 2, n_cores = 1, dist_fn = stats::dist)
head(mds$points) mds$eigen
mds$GOF points <- mds$points
plot(x[1:10, 1],
x[1:10, 2],
xlim = range(c(x[1:10,1],points[1:10,1])),
ylim = range(c(x[1:10,2], points[1:10,2])),
pch = 19,
col = "green")
text(x[1:10, 1], x[1:10, 2], labels=1:10)
points(points[1:10, 1], points[1:10, 2], pch = 19, col = "orange")
text(points[1:10, 1], points[1:10, 2], labels=1:10)
abline(v = 0, lwd=3, lty=2)
abline(h = 0, lwd=3, lty=2)



[Package bigmds version 2.0.1 Index]