TADPole {dtwclust} | R Documentation |
TADPole clustering
Description
Time-series Anytime Density Peaks Clustering as proposed by Begum et al. (2015).
Usage
TADPole(
data,
k = 2L,
dc,
window.size,
error.check = TRUE,
lb = "lbk",
trace = FALSE
)
tadpole(
data,
k = 2L,
dc,
window.size,
error.check = TRUE,
lb = "lbk",
trace = FALSE
)
Arguments
data |
A matrix or data frame where each row is a time series, or a list where each element is a time series. Multivariate series are not supported. |
k |
The number of desired clusters. Can be a vector with several values. |
dc |
The cutoff distance(s). Can be a vector with several values. |
window.size |
Window size constraint for DTW (Sakoe-Chiba). See details. |
error.check |
Logical indicating whether the function should try to detect inconsistencies and give more informative errors messages. Also used internally to avoid repeating checks. |
lb |
Which lower bound to use, "lbk" for |
trace |
Logical flag. If |
Details
This function can be called either directly or through tsclust()
.
TADPole clustering adopts a relatively new clustering framework and adapts it to time series clustering with DTW. See the cited article for the details of the algorithm.
Because of the way the algorithm works, it can be considered a kind of Partitioning Around
Medoids (PAM). This means that the cluster centroids are always elements of the data. However,
this algorithm is deterministic, depending on the value of dc
.
The algorithm first uses the DTW's upper and lower bounds (Euclidean and LB_Keogh respectively)
to find series with many close neighbors (in DTW space). Anything below the cutoff distance
(dc
) is considered a neighbor. Aided with this information, the algorithm then tries to prune
as many DTW calculations as possible in order to accelerate the clustering procedure. The series
that lie in dense areas (i.e. that have lots of neighbors) are taken as cluster centroids.
The algorithm relies on the DTW bounds, which are only defined for univariate time series of equal length.
Parallelization is supported in the following way:
For multiple
dc
values, multi-processing withforeach::foreach()
.The internal distance calculations use multi-threading with RcppParallel::RcppParallel.
The windowing constraint uses a centered window.
The calculations expect a value in window.size
that represents the distance between the point considered and one of the edges of the window.
Therefore, if, for example, window.size = 10
, the warping for an observation x_i
considers the points between x_{i-10}
and x_{i+10}
,
resulting in 10(2) + 1 = 21
observations falling within the window.
Value
A list with:
-
cl
: Cluster indices. -
centroids
: Indices of the centroids. -
distCalcPercentage
: Percentage of distance calculations that were actually performed.
For multiple k
/dc
values, a list of lists is returned, each internal list having the
aforementioned elements.
Parallel Computing
Please note that running tasks in parallel does not guarantee faster computations. The overhead introduced is sometimes too large, and it's better to run tasks sequentially.
This function uses the RcppParallel package for parallelization.
It uses all available threads by default (see RcppParallel::defaultNumThreads()
),
but this can be changed by the user with RcppParallel::setThreadOptions()
.
An exception to the above is when it is called within a foreach
parallel loop made by dtwclust.
If the parallel workers do not have the number of threads explicitly specified,
this function will default to 1 thread per worker.
See the parallelization vignette for more information - browseVignettes("dtwclust")
References
Begum N, Ulanova L, Wang J and Keogh E (2015). “Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy.” In Conference on Knowledge Discovery and Data Mining, series KDD '15. ISBN 978-1-4503-3664-2/15/08, doi:10.1145/2783258.2783286.