R: Downhill Riding (DR) Procedure

DR {funtimes}

R Documentation

Downhill Riding (DR) Procedure

Description

Downhill riding procedure for selecting optimal tuning parameters in clustering algorithms, using an (in)stability probe.

Usage

DR(X, method, minPts = 3, theta = 0.9, B = 500, lb = -30, ub = 10)

Arguments

`X`	an `n\times k` matrix where columns are `k` objects to be clustered, and each object contains n observations (objects could be a set of time series).
`method`	the clustering method to be used – currently either “TRUST” (Ciampi et al. 2010) or “DBSCAN” (Ester et al. 1996). If the method is `DBSCAN`, then set `MinPts` and optimal `\epsilon` is selected using DR. If the method is `TRUST`, then set `theta`, and optimal `\delta` is selected using DR.
`minPts`	the minimum number of samples in an `\epsilon`-neighborhood of a point to be considered as a core point. The `minPts` is to be used only with the `DBSCAN` method. The default value is 3.
`theta`	connectivity parameter `\theta \in (0,1)`, which is to be used only with the `TRUST` method. The default value is 0.9.
`B`	number of random splits in calculating the Average Cluster Deviation (ACD). The default value is 500.
`lb`, `ub`	endpoints for a range of search for the optimal parameter.

Details

Parameters lb,ub are endpoints for the search for the optimal parameter. The parameter candidates are calculated in a way such that P:= 1.1^x , x \in {lb,lb+0.5,lb+1.0,...,ub}. Although the default range of search is sufficiently wide, in some cases lb,ub can be further extended if a warning message is given.

For more discussion on properties of the considered clustering algorithms and the DR procedure see Huang et al. (2016) and Huang et al. (2018).

Value

A list containing the following components:

`P_opt`	the value of the optimal parameter. If the method is `DBSCAN`, then `P_opt` is optimal `\epsilon`. If the method is `TRUST`, then `P_opt` is optimal `\delta`.
`ACD_matrix`	a matrix that returns `ACD` for different values of a tuning parameter. If the method is `DBSCAN`, then the tuning parameter is `\epsilon`. If the method is `TRUST`, then the tuning parameter is `\delta`.

Author(s)

Xin Huang, Yulia R. Gel

References

Ciampi A, Appice A, Malerba D (2010). “Discovering trend-based clusters in spatially distributed data streams.” In International Workshop of Mining Ubiquitous and Social Environments, 107–122.

Ester M, Kriegel H, Sander J, Xu X (1996). “A density-based algorithm for discovering clusters in large spatial databases with noise.” In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD), volume 96(34), 226–231.

Huang X, Iliev IR, Brenning A, Gel YR (2016). “Space-time clustering with stability probe while riding downhill.” In Proceedings of the 2nd SIGKDD Workshop on Mining and Learning from Time Series (MiLeTS).

Huang X, Iliev IR, Lyubchich V, Gel YR (2018). “Riding down the bay: space-time clustering of ecological trends.” Environmetrics, 29(5–6), e2455. doi:10.1002/env.2455.

Examples

## Not run: 
## example 1
## use iris data to test DR procedure

data(iris)  
require(clue)  # calculate NMI to compare the clustering result with the ground truth
require(scatterplot3d)

Data <- scale(iris[,-5])
ground_truth_label <- iris[,5]

# perform DR procedure to select optimal eps for DBSCAN 
# and save it in variable eps_opt
eps_opt <- DR(t(Data), method="DBSCAN", minPts = 5)$P_opt   

# apply DBSCAN with the optimal eps on iris data 
# and save the clustering result in variable res
res <- dbscan(Data, eps = eps_opt, minPts =5)$cluster  

# calculate NMI to compare the clustering result with the ground truth label
clue::cl_agreement(as.cl_partition(ground_truth_label),
                   as.cl_partition(as.numeric(res)), method = "NMI") 
# visualize the clustering result and compare it with the ground truth result
# 3D visualization of clustering result using variables Sepal.Width, Sepal.Length, 
# and Petal.Length
scatterplot3d(Data[,-4],color = res)
# 3D visualization of ground truth result using variables Sepal.Width, Sepal.Length,
# and Petal.Length
scatterplot3d(Data[,-4],color = as.numeric(ground_truth_label))


## example 2
## use synthetic time series data to test DR procedure

require(funtimes)
require(clue) 
require(zoo)

# simulate 16 time series for 4 clusters, each cluster contains 4 time series
set.seed(114) 
samp_Ind <- sample(12,replace=F)
time_points <- 30
X <- matrix(0,nrow=time_points,ncol = 12)
cluster1 <- sapply(1:4,function(x) arima.sim(list(order = c(1, 0, 0), ar = c(0.2)),
                                             n = time_points, mean = 0, sd = 1))
cluster2 <- sapply(1:4,function(x) arima.sim(list(order = c(2 ,0, 0), ar = c(0.1, -0.2)),
                                             n = time_points, mean = 2, sd = 1))
cluster3 <- sapply(1:4,function(x) arima.sim(list(order = c(1, 0, 1), ar = c(0.3), ma = c(0.1)),
                                             n = time_points, mean = 6, sd = 1))

X[,samp_Ind[1:4]] <- t(round(cluster1, 4))
X[,samp_Ind[5:8]] <- t(round(cluster2, 4))
X[,samp_Ind[9:12]] <- t(round(cluster3, 4))


# create ground truth label of the synthetic data
ground_truth_label = matrix(1, nrow = 12, ncol = 1) 
for(k in 1:3){
    ground_truth_label[samp_Ind[(4*k - 4 + 1):(4*k)]] = k
}

# perform DR procedure to select optimal delta for TRUST
# and save it in variable delta_opt
delta_opt <- DR(X, method = "TRUST")$P_opt 

# apply TRUST with the optimal delta on the synthetic data 
# and save the clustering result in variable res
res <- CSlideCluster(X, Delta = delta_opt, Theta = 0.9)  

# calculate NMI to compare the clustering result with the ground truth label
clue::cl_agreement(as.cl_partition(as.numeric(ground_truth_label)),
                   as.cl_partition(as.numeric(res)), method = "NMI")

# visualize the clustering result and compare it with the ground truth result
# visualization of the clustering result obtained by TRUST
plot.zoo(X, type = "l", plot.type = "single", col = res, xlab = "Time index", ylab = "")
# visualization of the ground truth result 
plot.zoo(X, type = "l", plot.type = "single", col = ground_truth_label,
         xlab = "Time index", ylab = "")

## End(Not run)

[Package funtimes version 9.1 Index]