bdm.ptsne {bigMap} | R Documentation |
Parallelized t-SNE
Description
Starts the ptSNE algorithm (first step of the mapping protocol).
Usage
bdm.ptsne(bdm, threads = 3, type = "SOCK", layers = 2, rounds = 1,
boost = 2, whiten = 4, input.dim = NULL, ppx = 100, itr = 100,
tol = 1e-05, alpha = 0.5, Y.init = NULL, info = 1)
Arguments
bdm |
A bdm instance as generated by |
threads |
The number of parallel threads (in principle only limited by hardware resources, |
type |
The type of cluster: 'SOCK' (default) for intra-node parallelization, 'MPI' ( |
layers |
The number of layers ( |
rounds |
The number of rounds (2 by default). |
boost |
A running time accelerator factor. By default ( |
whiten |
Preprocessing of raw data. If |
input.dim |
If raw data is given as (or is transformed to) principal components, input.dim sets the number of principal components to be used as input dimensions. Otherwise all data columns are used as input dimensions. By default |
ppx |
The value of perplexity to compute similarities (100 by default). |
itr |
The number of iterations for computing input similarities (100 by default). |
tol |
The tolerance lower bound for computing input similarities (1e-05 by default). |
alpha |
The momentum factor (0.5 by default). |
Y.init |
A |
info |
Progress output information: 1 yields inter-round results for progressive analytics, 0 disables intermediate results. Default value is 1. |
Details
By default the algorithm is structured in \sqrt{n}
epochs of \sqrt{z}
iterations each, where n is the dataset size and z is the thread-size (z=n*layers/threads
). The running time of the algorithm is then determined by epochs*iters*t_i+ epochs*t_e
where t_i is the running time of a single iteration and t_e is the inter-epoch running time.
The boost factor is meant to reduce the running time. With boost > 1
the algorithm is structured in n/boost
epochs with z*boost
iterations each. This structure performs the same total number of iterations but arranged into a lower number of epochs, thus decreasing the total running time to epochs*iters*t_i + 1/boost*epochs*t_e
. When the number of threads is high, the inter-epoch time can be high, in particular when using 'MPI' parallelization, thus, reducing the number of epochs can result in a significant reduction of the total running time. The counterpart is that increasing the number of iterations per epoch might result in a lack of convergence, thus the boost factor must be used with caution. To the most of our knowledge using values up to boost=2.5
is generally safe.
In case of extremely large datasets, we strongly recommend to initialize the bdm instance with already preprocessed data and use whiten = 0
. Fast principal components approximations can be computed by means of e.g. flashpcaR
or scater
R packages.
Value
A copy of the input bdm instance with new element bdm$ptsne (t-SNE output).
Examples
# --- load example dataset
bdm.example()
# --- perform ptSNE
## Not run:
exMap <- bdm.ptsne(exMap, threads = 10, layers = 2, rounds = 2, ppx = 200)
## End(Not run)
# --- plot the Cost function
bdm.cost(exMap)
# --- plot ptSNE output
bdm.ptsne.plot(exMap)