bdm.ptsne {bigMap}R Documentation

Parallelized t-SNE

Description

Starts the ptSNE algorithm (first step of the mapping protocol).

Usage

bdm.ptsne(bdm, threads = 3, type = "SOCK", layers = 2, rounds = 1,
  boost = 2, whiten = 4, input.dim = NULL, ppx = 100, itr = 100,
  tol = 1e-05, alpha = 0.5, Y.init = NULL, info = 1)

Arguments

bdm

A bdm instance as generated by bdm.init().

threads

The number of parallel threads (in principle only limited by hardware resources, i.e. number of cores and available memory)

type

The type of cluster: 'SOCK' (default) for intra-node parallelization, 'MPI' (message passing interface) for inter-node parallelization.

layers

The number of layers (minimum 2, maximum the number of threads).

rounds

The number of rounds (2 by default).

boost

A running time accelerator factor. By default (boost == 1). See details.

whiten

Preprocessing of raw data. If whiten = 4 (default value) raw data is transformed to principal components (PCA) and whitened afterwards. If whiten = 3 only PCA is performed with NO whitening. If whiten = 2 raw data is only centered and scaled. If whiten = 1 raw data is only centered. If whiten = 0 no preprocessing is performed at all.

input.dim

If raw data is given as (or is transformed to) principal components, input.dim sets the number of principal components to be used as input dimensions. Otherwise all data columns are used as input dimensions. By default input.dim = ncol(bdm$data).

ppx

The value of perplexity to compute similarities (100 by default).

itr

The number of iterations for computing input similarities (100 by default).

tol

The tolerance lower bound for computing input similarities (1e-05 by default).

alpha

The momentum factor (0.5 by default).

Y.init

A nx2 matrix with initial mapping positions. By default (NULL) will use random initial positions)

info

Progress output information: 1 yields inter-round results for progressive analytics, 0 disables intermediate results. Default value is 1.

Details

By default the algorithm is structured in √{n} epochs of √{z} iterations each, where n is the dataset size and z is the thread-size (z=n*layers/threads). The running time of the algorithm is then determined by epochs*iters*t_i+ epochs*t_e where t_i is the running time of a single iteration and t_e is the inter-epoch running time.

The boost factor is meant to reduce the running time. With boost > 1 the algorithm is structured in n/boost epochs with z*boost iterations each. This structure performs the same total number of iterations but arranged into a lower number of epochs, thus decreasing the total running time to epochs*iters*t_i + 1/boost*epochs*t_e. When the number of threads is high, the inter-epoch time can be high, in particular when using 'MPI' parallelization, thus, reducing the number of epochs can result in a significant reduction of the total running time. The counterpart is that increasing the number of iterations per epoch might result in a lack of convergence, thus the boost factor must be used with caution. To the most of our knowledge using values up to boost=2.5 is generally safe.

In case of extremely large datasets, we strongly recommend to initialize the bdm instance with already preprocessed data and use whiten = 0. Fast principal components approximations can be computed by means of e.g. flashpcaR or scater R packages.

Value

A copy of the input bdm instance with new element bdm$ptsne (t-SNE output).

Examples


# --- load example dataset
bdm.example()
# --- perform ptSNE
## Not run: 
exMap <- bdm.ptsne(exMap, threads = 10, layers = 2, rounds = 2, ppx = 200)

## End(Not run)
# --- plot the Cost function
bdm.cost(exMap)
# --- plot ptSNE output
bdm.ptsne.plot(exMap)

[Package bigMap version 2.3.1 Index]