R: Parallelized t-SNE

bdm.ptsne {bigMap}

R Documentation

Parallelized t-SNE

Description

Starts the ptSNE algorithm (first step of the mapping protocol).

Usage

bdm.ptsne(bdm, threads = 3, type = "SOCK", layers = 2, rounds = 1,
  boost = 2, whiten = 4, input.dim = NULL, ppx = 100, itr = 100,
  tol = 1e-05, alpha = 0.5, Y.init = NULL, info = 1)

Arguments

`bdm`	A `bdm` instance as generated by `bdm.init()`.
`threads`	The number of parallel threads (in principle only limited by hardware resources, `i.e.` number of cores and available memory)
`type`	The type of cluster: 'SOCK' (default) for intra-node parallelization, 'MPI' (`message passing interface`) for inter-node parallelization.
`layers`	The number of layers (`minimum` 2, `maximum` the number of threads).
`rounds`	The number of rounds (2 by default).
`boost`	A running time accelerator factor. By default (`boost == 1`). See details.
`whiten`	Preprocessing of raw data. If `whiten = 4` (default value) raw data is transformed to principal components (PCA) and whitened afterwards. If `whiten = 3` only PCA is performed with NO whitening. If `whiten = 2` raw data is only centered and scaled. If `whiten = 1` raw data is only centered. If `whiten = 0` no preprocessing is performed at all.
`input.dim`	If raw data is given as (or is transformed to) principal components, `input.dim` sets the number of principal components to be used as input dimensions. Otherwise all data columns are used as input dimensions. By default `input.dim = ncol(bdm$data)`.
`ppx`	The value of perplexity to compute similarities (100 by default).
`itr`	The number of iterations for computing input similarities (100 by default).
`tol`	The tolerance lower bound for computing input similarities (1e-05 by default).
`alpha`	The momentum factor (0.5 by default).
`Y.init`	A `nx2` matrix with initial mapping positions. By default (`NULL`) will use random initial positions)
`info`	Progress output information: 1 yields inter-round results for progressive analytics, 0 disables intermediate results. Default value is 1.

Details

By default the algorithm is structured in \sqrt{n} epochs of \sqrt{z} iterations each, where n is the dataset size and z is the thread-size (z=n*layers/threads). The running time of the algorithm is then determined by epochs*iters*t_i+ epochs*t_e where t_i is the running time of a single iteration and t_e is the inter-epoch running time.

The boost factor is meant to reduce the running time. With boost > 1 the algorithm is structured in n/boost epochs with z*boost iterations each. This structure performs the same total number of iterations but arranged into a lower number of epochs, thus decreasing the total running time to epochs*iters*t_i + 1/boost*epochs*t_e. When the number of threads is high, the inter-epoch time can be high, in particular when using 'MPI' parallelization, thus, reducing the number of epochs can result in a significant reduction of the total running time. The counterpart is that increasing the number of iterations per epoch might result in a lack of convergence, thus the boost factor must be used with caution. To the most of our knowledge using values up to boost=2.5 is generally safe.

In case of extremely large datasets, we strongly recommend to initialize the bdm instance with already preprocessed data and use whiten = 0. Fast principal components approximations can be computed by means of e.g. flashpcaR or scater R packages.

Value

A copy of the input bdm instance with new element bdm$ptsne (t-SNE output).

Examples


# --- load example dataset
bdm.example()
# --- perform ptSNE
## Not run: 
exMap <- bdm.ptsne(exMap, threads = 10, layers = 2, rounds = 2, ppx = 200)

## End(Not run)
# --- plot the Cost function
bdm.cost(exMap)
# --- plot ptSNE output
bdm.ptsne.plot(exMap)

[Package bigMap version 2.3.1 Index]