tsclust {dtwclust} | R Documentation |
Time series clustering
Description
This is the main function to perform time series clustering. See the details and the examples for
more information, as well as the included package vignettes (which can be found by typing
browseVignettes("dtwclust")
). A convenience wrapper is available in compare_clusterings()
,
and a shiny app in interactive_clustering()
.
Usage
tsclust(
series = NULL,
type = "partitional",
k = 2L,
...,
preproc = NULL,
distance = "dtw_basic",
centroid = ifelse(type == "fuzzy", "fcm", "pam"),
control = do.call(paste0(type, "_control"), list()),
args = tsclust_args(),
seed = NULL,
trace = FALSE,
error.check = TRUE
)
Arguments
series |
A list of series, a numeric matrix or a data frame. Matrices and data frames are
coerced to a list row-wise (see |
type |
What type of clustering method to use: |
k |
Number of desired clusters. It can be a numeric vector with different values. |
... |
Arguments to pass to preprocessing, centroid and distance functions (added to
|
preproc |
Function to preprocess data. Defaults to |
distance |
A registered distance from |
centroid |
Either a supported string, or an appropriate function to calculate centroids when using partitional/hierarchical/tadpole methods. See Centroids section. |
control |
An appropriate list of controls. See tsclust-controls. |
args |
An appropriate list of arguments for preprocessing, distance and centroid functions.
See |
seed |
Random seed for reproducibility. |
trace |
Logical flag. If |
error.check |
Logical indicating whether the function should try to detect inconsistencies and give more informative errors messages. Also used internally to avoid repeating checks. |
Details
Partitional and fuzzy clustering procedures use a custom implementation. Hierarchical clustering
is done with stats::hclust()
by default. TADPole clustering uses the TADPole()
function.
Specifying type
= "partitional"
, preproc
= zscore
, distance
= "sbd"
and centroid
=
"shape"
is equivalent to the k-Shape algorithm (Paparrizos and Gravano 2015).
The series
may be provided as a matrix, a data frame or a list. Matrices and data frames are
coerced to a list, both row-wise. Only lists can have series with different lengths or multiple
dimensions. Most of the optimizations require series to have the same length, so consider
reinterpolating them to save some time (see Ratanamahatana and Keogh 2004; reinterpolate()
). No
missing values are allowed.
In the case of multivariate time series, they should be provided as a list of matrices, where
time spans the rows of each matrix and the variables span the columns (see CharTrajMV for an
example). All included centroid functions should work with the aforementioned format, although
shape
is not recommended. Note that the plot
method will simply append all dimensions
(columns) one after the other.
Value
An object with an appropriate class from TSClusters.
If control$nrep > 1
and a partitional procedure is used, length(method)
> 1
and
hierarchical procedures are used, or length(k)
>
1
, a list of objects is returned.
Centroid Calculation
In the case of partitional/fuzzy algorithms, a suitable function should calculate the cluster centroids at every iteration. In this case, the centroids may also be time series. Fuzzy clustering uses the standard fuzzy c-means centroid by default.
In either case, a custom function can be provided. If one is provided, it will receive the following parameters with the shown names (examples for partitional clustering are shown in parentheses):
-
x
: The whole data list (list(ts1, ts2, ts3)
) -
cl_id
: An integer vector with length equal to the number of series indata
, indicating which cluster a series belongs to (c(1L, 2L, 2L)
) -
k
: The desired number of total clusters (2L
) -
cent
: The current centroids in order, in a list (list(centroid1, centroid2)
) -
cl_old
: The membership vector of the previous iteration (c(1L, 1L, 2L)
) The elements of
...
that match its formal arguments
In case of fuzzy clustering, the membership vectors (2nd and 5th elements above) are matrices with number of rows equal to amount of elements in the data, and number of columns equal to the number of desired clusters. Each row must sum to 1.
The other option is to provide a character string for the custom implementations. The following options are available:
"mean": The average along each dimension. In other words, the average of all
x^j_i
among thej
series that belong to the same cluster for all time pointst_i
."median": The median along each dimension. Similar to mean.
"shape": Shape averaging. By default, all series are z-normalized in this case, since the resulting centroids will also have this normalization. See
shape_extraction()
for more details."dba": DTW Barycenter Averaging. See
DBA()
for more details."sdtw_cent": Soft-DTW centroids, See
sdtw_cent()
for more details."pam": Partition around medoids (PAM). This basically means that the cluster centroids are always one of the time series in the data. In this case, the distance matrix can be pre-computed once using all time series in the data and then re-used at each iteration. It usually saves overhead overall for small datasets (see tsclust-controls).
"fcm": Fuzzy c-means. Only supported for fuzzy clustering and used by default in that case.
"fcmdd": Fuzzy c-medoids. Only supported for fuzzy clustering. It always precomputes/uses the whole cross-distance matrix.
The dba
, shape
and sdtw_cent
implementations check for parallelization. Note that only
shape
, dba
, sdtw_cent
, pam
and fcmdd
support series of different length. Also note
that for shape
, dba
and sdtw_cent
, this support has a caveat: the final centroids' length
will depend on the length of those series that were randomly chosen at the beginning of the
clustering algorithm. For example, if the series in the dataset have a length of either 10 or
15, 2 clusters are desired, and the initial choice selects two series with length of 10, the
final centroids will have this same length.
As special cases, if hierarchical or tadpole clustering is used, you can provide a centroid
function that takes a list of series as first input. It will also receive the contents of
args$cent
that match its formal arguments, and should return a single centroid series. These
centroids are returned in the centroids
slot. By default, the medoid of each cluster is
extracted (similar to what pam_cent()
does).
In the following cases, the centroids
list will have an attribute series_id
with an integer
vector indicating which series
were chosen as centroids:
Partitional clustering using "pam" centroid.
Fuzzy clustering using "fcmdd" centroid.
Hierarchical clustering with the default centroid extraction.
TADPole clustering with the default centroid extraction.
Distance Measures
The distance measure to be used with partitional, hierarchical and fuzzy clustering can be
modified with the distance
parameter. The supported option is to provide a string, which must
represent a compatible distance registered with proxy
's proxy::dist()
. Registration is done
via proxy::pr_DB()
, and extra parameters can be provided in args$dist
(see the examples).
Note that you are free to create your own distance functions and register them. Optionally, you
can use one of the following custom implementations (all registered with proxy
):
-
"dtw"
: DTW, optionally with a Sakoe-Chiba/Slanted-band constraint. Done withdtw::dtw()
. -
"dtw2"
: DTW with L2 norm and optionally a Sakoe-Chiba/Slanted-band constraint. Seedtw2()
. -
"dtw_basic"
: A custom version of DTW with less functionality, but faster. Seedtw_basic()
. -
"dtw_lb"
: DTW with L1 or L2 norm and a Sakoe-Chiba constraint. Some computations are avoided by first estimating the distance matrix with Lemire's lower bound and then iteratively refining with DTW. Seedtw_lb()
. Not suitable forpam.precompute
=TRUE
nor hierarchical clustering. -
"lbk"
: Keogh's lower bound for DTW with either L1 or L2 norm for the Sakoe-Chiba constraint. Seelb_keogh()
. -
"lbi"
: Lemire's lower bound for DTW with either L1 or L2 norm for the Sakoe-Chiba constraint. Seelb_improved()
. -
"sbd"
: Shape-based distance. Seesbd()
. -
"gak"
: Global alignment kernels. Seegak()
. -
"sdtw"
: Soft-DTW. Seesdtw()
.
Out of the aforementioned, only the distances based on DTW lower bounds don't support series of different length. The lower bounds are probably unsuitable for direct clustering unless series are very easily distinguishable.
If you know that the distance function is symmetric, and you use a hierarchical algorithm, or a
partitional algorithm with PAM centroids, or fuzzy c-medoids, some time can be saved by
calculating only half the distance matrix. Therefore, consider setting the symmetric control
parameter to TRUE
if this is the case (see tsclust-controls).
Preprocessing
It is strongly advised to use z-normalization in case of centroid = "shape"
, because the
resulting series have this normalization (see shape_extraction()
). Therefore, zscore()
is
the default in this case. The user can, however, specify a custom function that performs any
transformation on the data, but the user must make sure that the format stays consistent, i.e.
a list of time series.
Setting to NULL
means no preprocessing (except for centroid = "shape"
). A provided function
will receive the data as first argument, followed by the contents of args$preproc
that match
its formal arguments.
It is convenient to provide this function if you're planning on using the stats::predict()
generic (see also TSClusters-methods).
Repetitions
Due to their stochastic nature, partitional clustering is usually repeated several times with
different random seeds to allow for different starting points. This function uses
parallel::nextRNGStream()
to obtain different seed streams for each repetition, utilizing the
seed
parameter (if provided) to initialize it. If more than one repetition is made, the
streams are returned in an attribute called rng
.
Multiple values of k
can also be provided to get different partitions using any type
of
clustering.
Repetitions are greatly optimized when PAM centroids are used and the whole distance matrix is precomputed, since said matrix is reused for every repetition.
Parallel Computing
Please note that running tasks in parallel does not guarantee faster computations. The overhead introduced is sometimes too large, and it's better to run tasks sequentially.
The user can register a parallel backend, e.g. with the doParallel package, in order to
attempt to speed up the calculations (see the examples). This relies on foreach::foreach()
,
i.e. it uses multi-processing.
Multi-processing is used in partitional and fuzzy clustering for multiple values of k
and/or
nrep
(in partitional_control()
). See TADPole()
to know how it uses parallelization. For
cross-distance matrix calculations, the parallelization strategy depends on whether the
distance is included with dtwclust or not, see the caveats in tsclustFamily.
If you register a parallel backend and special packages must be loaded, provide their names in
the packages
element of control
. Note that "dtwclust" is always loaded in each parallel
worker, so that doesn't need to be included. Alternatively, you may want to pre-load
dtwclust in each worker with parallel::clusterEvalQ()
.
Note
The lower bounds are defined only for time series of equal length. They are not symmetric,
and DTW
is not symmetric in general.
Author(s)
Alexis Sarda-Espinosa
References
Please refer to the package vignette references (which can be loaded by typing
vignette("dtwclust")
).
See Also
TSClusters, TSClusters-methods, tsclustFamily, tsclust-controls,
compare_clusterings()
, interactive_clustering()
, ssdtwclust()
.
Examples
#' NOTE: More examples are available in the vignette. Here are just some miscellaneous
#' examples that might come in handy. They should all work, but some don't run
#' automatically.
# Load data
data(uciCT)
# ====================================================================================
# Simple partitional clustering with Euclidean distance and PAM centroids
# ====================================================================================
# Reinterpolate to same length
series <- reinterpolate(CharTraj, new.length = max(lengths(CharTraj)))
# Subset for speed
series <- series[1:20]
labels <- CharTrajLabels[1:20]
# Making many repetitions
pc.l2 <- tsclust(series, k = 4L,
distance = "L2", centroid = "pam",
seed = 3247, trace = TRUE,
control = partitional_control(nrep = 10L))
# Cluster validity indices
sapply(pc.l2, cvi, b = labels)
# ====================================================================================
# Hierarchical clustering with Euclidean distance
# ====================================================================================
# Re-use the distance matrix from the previous example (all matrices are the same)
# Use all available linkage methods for function hclust
hc.l2 <- tsclust(series, type = "hierarchical",
k = 4L, trace = TRUE,
control = hierarchical_control(method = "all",
distmat = pc.l2[[1L]]@distmat))
# Plot the best dendrogram according to variation of information
plot(hc.l2[[which.min(sapply(hc.l2, cvi, b = labels, type = "VI"))]])
# ====================================================================================
# Multivariate time series
# ====================================================================================
# Multivariate series, provided as a list of matrices
mv <- CharTrajMV[1L:20L]
# Using GAK distance
mvc <- tsclust(mv, k = 4L, distance = "gak", seed = 390,
args = tsclust_args(dist = list(sigma = 100)))
# Note how the variables of each series are appended one after the other in the plot
plot(mvc)
## Not run:
# ====================================================================================
# This function is more verbose but allows for more explicit fine-grained control
# ====================================================================================
tsc <- tsclust(series, k = 4L,
distance = "gak", centroid = "dba",
preproc = zscore, seed = 382L, trace = TRUE,
control = partitional_control(iter.max = 30L),
args = tsclust_args(preproc = list(center = FALSE),
dist = list(window.size = 20L,
sigma = 100),
cent = list(window.size = 15L,
norm = "L2",
trace = TRUE)))
# ====================================================================================
# Registering a custom distance with the 'proxy' package and using it
# ====================================================================================
# Normalized asymmetric DTW distance
ndtw <- function(x, y, ...) {
dtw::dtw(x, y, step.pattern = asymmetric,
distance.only = TRUE, ...)$normalizedDistance
}
# Registering the function with 'proxy'
if (!pr_DB$entry_exists("nDTW"))
proxy::pr_DB$set_entry(FUN = ndtw, names=c("nDTW"),
loop = TRUE, type = "metric", distance = TRUE,
description = "Normalized asymmetric DTW")
# Subset of (original) data for speed
pc.ndtw <- tsclust(series[-1L], k = 4L,
distance = "nDTW",
seed = 8319,
trace = TRUE,
args = tsclust_args(dist = list(window.size = 18L)))
# Which cluster would the first series belong to?
# Notice that newdata is provided as a list
predict(pc.ndtw, newdata = series[1L])
# ====================================================================================
# Custom hierarchical clustering
# ====================================================================================
require(cluster)
hc.diana <- tsclust(series, type = "h", k = 4L,
distance = "L2", trace = TRUE,
control = hierarchical_control(method = diana))
plot(hc.diana, type = "sc")
# ====================================================================================
# TADPole clustering
# ====================================================================================
pc.tadp <- tsclust(series, type = "tadpole", k = 4L,
control = tadpole_control(dc = 1.5,
window.size = 18L))
# Modify plot, show only clusters 3 and 4
plot(pc.tadp, clus = 3:4,
labs.arg = list(title = "TADPole, clusters 3 and 4",
x = "time", y = "series"))
# Saving and modifying the ggplot object with custom time labels
require(scales)
t <- seq(Sys.Date(), len = length(series[[1L]]), by = "day")
gpc <- plot(pc.tadp, time = t, plot = FALSE)
gpc + ggplot2::scale_x_date(labels = date_format("%b-%Y"),
breaks = date_breaks("2 months"))
# ====================================================================================
# Specifying a centroid function for prototype extraction in hierarchical clustering
# ====================================================================================
# Seed is due to possible randomness in shape_extraction when selecting a basis series
hc.sbd <- tsclust(CharTraj, type = "hierarchical",
k = 20L, distance = "sbd",
preproc = zscore, centroid = shape_extraction,
seed = 320L)
plot(hc.sbd, type = "sc")
# ====================================================================================
# Using parallel computation to optimize several random repetitions
# and distance matrix calculation
# ====================================================================================
require(doParallel)
# Create parallel workers
cl <- makeCluster(detectCores())
invisible(clusterEvalQ(cl, library(dtwclust)))
registerDoParallel(cl)
## Use constrained DTW and PAM
pc.dtw <- tsclust(CharTraj, k = 20L, seed = 3251, trace = TRUE,
args = tsclust_args(dist = list(window.size = 18L)))
## Use constrained DTW with DBA centroids
pc.dba <- tsclust(CharTraj, k = 20L, centroid = "dba",
seed = 3251, trace = TRUE,
args = tsclust_args(dist = list(window.size = 18L),
cent = list(window.size = 18L)))
#' Using distance based on global alignment kernels
pc.gak <- tsclust(CharTraj, k = 20L,
distance = "gak",
centroid = "dba",
seed = 8319,
trace = TRUE,
control = partitional_control(nrep = 8L),
args = tsclust_args(dist = list(window.size = 18L),
cent = list(window.size = 18L)))
# Stop parallel workers
stopCluster(cl)
# Return to sequential computations. This MUST be done if stopCluster() was called
registerDoSEQ()
## End(Not run)