Univariate Clustering {Ckmeans.1d.dp} | R Documentation |

Perform optimal univariate *k*-means or *k*-median clustering in linear (fastest), loglinear, or quadratic (slowest) time.

Ckmeans.1d.dp(x, k=c(1,9), y=1, method=c("linear", "loglinear", "quadratic"), estimate.k=c("BIC", "BIC 3.4.12")) Ckmedian.1d.dp(x, k=c(1,9), y=1, method=c("linear", "loglinear", "quadratic"), estimate.k=c("BIC", "BIC 3.4.12"))

`x` |
a numeric vector of data to be clustered. All |

`k` |
either an exact integer number of clusters, or a vector of length two specifying the minimum and maximum numbers of clusters to be examined. The default is |

`y` |
a value of 1 (default) to specify equal weights of 1 for each element in |

`method` |
a character string to specify the speedup method to the original cubic runtime dynamic programming. The default is |

`estimate.k` |
a character string to specify the method to estimate optimal |

`Ckmean.1d.dp`

minimizes unweighted or weighted within-cluster sum of squared distance (L2).

`Ckmedian.1d.dp`

minimizes within-cluster sum of distance (L1). Only unweighted solution is implemented and guarantees optimality.

In contrast to the heuristic `k`-means algorithms implemented in function `kmeans`

, this function optimally assigns elements in numeric vector `x`

into `k`

clusters by dynamic programming (Wang and Song 2011; Song and Zhong 2020). It minimizes the total of within-cluster sums of squared distances (`withinss`) between each element and its corresponding cluster mean. When a range is provided for `k`

, the exact number of clusters is determined by Bayesian information criterion (Song and Zhong 2020). Different from the heuristic `k`-means algorithms whose results may be non-optimal or change from run to run, the result of Ckmeans.1d.dp is guaranteed to be optimal and reproducible, and its advantage in efficiency and accuracy over heuristic `k`-means methods is most pronounced at large `k`.

The `estimate.k`

argument specifies the method to select optimal `k`

based on the Gaussian mixture model using the Bayesian information criterion (BIC). When `estimate.k="BIC"`

, it effectively deals with variance estimation for a cluster with identical values. When `estimate.k="BIC 3.4.12"`

, it uses the code in version 3.4.12 and earlier to estimate `k`

.

The `method`

argument specifies one of three options to speed up the original dynamic programming taking a runtime cubic in sample size `n`. The default `"linear"`

option, giving a total runtime of *O(n lg n + kn)* or *O(kn)* (if `x`

is already sorted in ascending order) is the fastest option but uses the most memory (still *O(kn)*) (Song and Zhong 2020); the `"loglinear"`

option, with a runtime of *O(kn lg n)*, is slightly slower but uses the least memory (Song and Zhong 2020); the slowest `"quadratic"`

option (Wang and Song 2011), with a runtime of *O(kn^2)*, is provided for the purpose of testing on small data sets.

When the sample size `n` is too large to create two *k x n* dynamic programming matrices in memory, we recommend the heuristic solutions implemented in the `kmeans`

function in package stats.

An object of class "`Ckmeans.1d.dp`

" or "`Ckmedian.1d.dp`

". It is a list containing the following components:

`cluster` |
a vector of clusters assigned to each element in |

`centers` |
a numeric vector of the (weighted) means for each cluster. |

`withinss` |
a numeric vector of the (weighted) within-cluster sum of squares for each cluster. |

`size` |
a vector of the (weighted) number of elements in each cluster. |

`totss` |
total sum of (weighted) squared distances between each element and the sample mean. This statistic is not dependent on the clustering result. |

`tot.withinss` |
total sum of (weighted) within-cluster squared distances between each element and its cluster mean. This statistic is minimized given the number of clusters. |

`betweenss` |
sum of (weighted) squared distances between each cluster mean and sample mean. This statistic is maximized given the number of clusters. |

`xname` |
a character string. The actual name of the |

`yname` |
a character string. The actual name of the |

Each class has a print and a plot method, which are described along with `print.Ckmeans.1d.dp`

and `plot.Ckmeans.1d.dp`

.

Joe Song and Haizhou Wang

Song M, Zhong H (2020).
“Efficient weighted univariate clustering maps outstanding dysregulated genomic zones in human cancers.”
*Bioinformatics*.
doi: 10.1093/bioinformatics/btaa613, [Published online ahead of print, 2020 Jul 3].

Wang H, Song M (2011).
“Ckmeans.1d.dp: Optimal *k*-means clustering in one dimension by dynamic programming.”
*The R Journal*, **3**(2), 29–33.
doi: 10.32614/RJ-2011-015.

`ahist`

, `plot.Ckmeans.1d.dp`

, `print.Ckmeans.1d.dp`

in this package.

`kmeans`

in package stats that implements several heuristic *k*-means algorithms.

# Ex. 1 The number of clusters is provided. # Generate data from a Gaussian mixture model of three components x <- c(rnorm(50, sd=0.2), rnorm(50, mean=1, sd=0.3), rnorm(100, mean=-1, sd=0.25)) # Divide x into 3 clusters k <- 3 result <- Ckmedian.1d.dp(x, k) plot(result, main="Optimal univariate k-median given k") result <- Ckmeans.1d.dp(x, k) plot(result, main="Optimal univariate k-means given k") plot(x, col=result$cluster, pch=result$cluster, cex=1.5, main="Optimal univariate k-means clustering given k", sub=paste("Number of clusters given:", k)) abline(h=result$centers, col=1:k, lty="dashed", lwd=2) legend("bottomleft", paste("Cluster", 1:k), col=1:k, pch=1:k, cex=1.5, bty="n") # Ex. 2 The number of clusters is determined by Bayesian # information criterion # Generate data from a Gaussian mixture model of three components x <- c(rnorm(50, mean=-3, sd=1), rnorm(50, mean=0, sd=.5), rnorm(50, mean=3, sd=1)) # Divide x into k clusters, k automatically selected (default: 1~9) result <- Ckmedian.1d.dp(x) plot(result, main="Optimal univariate k-median with k estimated") result <- Ckmeans.1d.dp(x) plot(result, main="Optimal univariate k-means with k estimated") k <- max(result$cluster) plot(x, col=result$cluster, pch=result$cluster, cex=1.5, main="Optimal univariate k-means clustering with k estimated", sub=paste("Number of clusters is estimated to be", k)) abline(h=result$centers, col=1:k, lty="dashed", lwd=2) legend("topleft", paste("Cluster", 1:k), col=1:k, pch=1:k, cex=1.5, bty="n") # Ex. 3 Segmenting a time course using optimal weighted # univariate clustering n <- 160 t <- seq(0, 2*pi*2, length=n) n1 <- 1:(n/2) n2 <- (max(n1)+1):n y1 <- abs(sin(1.5*t[n1]) + 0.1*rnorm(length(n1))) y2 <- abs(sin(0.5*t[n2]) + 0.1*rnorm(length(n2))) y <- c(y1, y2) w <- y^8 # stress the peaks res <- Ckmeans.1d.dp(t, k=c(1:10), w) plot(res) plot(t, w, main = "Time course weighted k-means", col=res$cluster, pch=res$cluster, xlab="Time t", ylab="Transformed intensity w", type="h") abline(v=res$centers, col="chocolate", lty="dashed") text(res$centers, max(w) * .95, cex=0.5, font=2, paste(round(res$size / sum(res$size) * 100), "/ 100"))

[Package *Ckmeans.1d.dp* version 4.3.3 Index]