bdm.ptsne {bigMap}R Documentation

Parallelized t-SNE


Starts the ptSNE algorithm (first step of the mapping protocol).


bdm.ptsne(bdm, threads = 3, type = "SOCK", layers = 2, rounds = 1,
  boost = 2, whiten = 4, input.dim = NULL, ppx = 100, itr = 100,
  tol = 1e-05, alpha = 0.5, Y.init = NULL, info = 1)



A bdm instance as generated by bdm.init().


The number of parallel threads (in principle only limited by hardware resources, i.e. number of cores and available memory)


The type of cluster: 'SOCK' (default) for intra-node parallelization, 'MPI' (message passing interface) for inter-node parallelization.


The number of layers (minimum 2, maximum the number of threads).


The number of rounds (2 by default).


A running time accelerator factor. By default (boost == 1). See details.


Preprocessing of raw data. If whiten = 4 (default value) raw data is transformed to principal components (PCA) and whitened afterwards. If whiten = 3 only PCA is performed with NO whitening. If whiten = 2 raw data is only centered and scaled. If whiten = 1 raw data is only centered. If whiten = 0 no preprocessing is performed at all.


If raw data is given as (or is transformed to) principal components, input.dim sets the number of principal components to be used as input dimensions. Otherwise all data columns are used as input dimensions. By default input.dim = ncol(bdm$data).


The value of perplexity to compute similarities (100 by default).


The number of iterations for computing input similarities (100 by default).


The tolerance lower bound for computing input similarities (1e-05 by default).


The momentum factor (0.5 by default).


A nx2 matrix with initial mapping positions. By default (NULL) will use random initial positions)


Progress output information: 1 yields inter-round results for progressive analytics, 0 disables intermediate results. Default value is 1.


By default the algorithm is structured in \sqrt{n} epochs of \sqrt{z} iterations each, where n is the dataset size and z is the thread-size (z=n*layers/threads). The running time of the algorithm is then determined by epochs*iters*t_i+ epochs*t_e where t_i is the running time of a single iteration and t_e is the inter-epoch running time.

The boost factor is meant to reduce the running time. With boost > 1 the algorithm is structured in n/boost epochs with z*boost iterations each. This structure performs the same total number of iterations but arranged into a lower number of epochs, thus decreasing the total running time to epochs*iters*t_i + 1/boost*epochs*t_e. When the number of threads is high, the inter-epoch time can be high, in particular when using 'MPI' parallelization, thus, reducing the number of epochs can result in a significant reduction of the total running time. The counterpart is that increasing the number of iterations per epoch might result in a lack of convergence, thus the boost factor must be used with caution. To the most of our knowledge using values up to boost=2.5 is generally safe.

In case of extremely large datasets, we strongly recommend to initialize the bdm instance with already preprocessed data and use whiten = 0. Fast principal components approximations can be computed by means of e.g. flashpcaR or scater R packages.


A copy of the input bdm instance with new element bdm$ptsne (t-SNE output).


# --- load example dataset
# --- perform ptSNE
## Not run: 
exMap <- bdm.ptsne(exMap, threads = 10, layers = 2, rounds = 2, ppx = 200)

## End(Not run)
# --- plot the Cost function
# --- plot ptSNE output

[Package bigMap version 2.3.1 Index]