Univariate Clustering {Ckmeans.1d.dp} | R Documentation |
Optimal (Weighted) Univariate Clustering
Description
Perform optimal univariate k
-means or k
-median clustering in linear (fastest), loglinear, or quadratic (slowest) time.
Usage
Ckmeans.1d.dp(x, k=c(1,9), y=1,
method=c("linear", "loglinear", "quadratic"),
estimate.k=c("BIC", "BIC 3.4.12"))
Ckmedian.1d.dp(x, k=c(1,9), y=1,
method=c("linear", "loglinear", "quadratic"),
estimate.k=c("BIC", "BIC 3.4.12"))
Arguments
x |
a numeric vector of data to be clustered. All |
k |
either an exact integer number of clusters, or a vector of length two specifying the minimum and maximum numbers of clusters to be examined. The default is |
y |
a value of 1 (default) to specify equal weights of 1 for each element in |
method |
a character string to specify the speedup method to the original cubic runtime dynamic programming. The default is |
estimate.k |
a character string to specify the method to estimate optimal |
Details
Ckmean.1d.dp
minimizes unweighted or weighted within-cluster sum of squared distance (L2).
Ckmedian.1d.dp
minimizes within-cluster sum of distance (L1). Only unweighted solution is implemented and guarantees optimality.
In contrast to the heuristic k-means algorithms implemented in function kmeans
, this function optimally assigns elements in numeric vector x
into k
clusters by dynamic programming (Wang and Song 2011; Song and Zhong 2020). It minimizes the total of within-cluster sums of squared distances (withinss) between each element and its corresponding cluster mean. When a range is provided for k
, the exact number of clusters is determined by Bayesian information criterion (Song and Zhong 2020). Different from the heuristic k-means algorithms whose results may be non-optimal or change from run to run, the result of Ckmeans.1d.dp is guaranteed to be optimal and reproducible, and its advantage in efficiency and accuracy over heuristic k-means methods is most pronounced at large k.
The estimate.k
argument specifies the method to select optimal k
based on the Gaussian mixture model using the Bayesian information criterion (BIC). When estimate.k="BIC"
, it effectively deals with variance estimation for a cluster with identical values. When estimate.k="BIC 3.4.12"
, it uses the code in version 3.4.12 and earlier to estimate k
.
The method
argument specifies one of three options to speed up the original dynamic programming taking a runtime cubic in sample size n. The default "linear"
option, giving a total runtime of O(n \lg n + kn)
or O(kn)
(if x
is already sorted in ascending order) is the fastest option but uses the most memory (still O(kn)
) (Song and Zhong 2020); the "loglinear"
option, with a runtime of O(kn \lg n)
, is slightly slower but uses the least memory (Song and Zhong 2020); the slowest "quadratic"
option (Wang and Song 2011), with a runtime of O(kn^2)
, is provided for the purpose of testing on small data sets.
When the sample size n is too large to create two k \times n
dynamic programming matrices in memory, we recommend the heuristic solutions implemented in the kmeans
function in package stats.
Value
An object of class "Ckmeans.1d.dp
" or "Ckmedian.1d.dp
". It is a list containing the following components:
cluster |
a vector of clusters assigned to each element in |
centers |
a numeric vector of the (weighted) means for each cluster. |
withinss |
a numeric vector of the (weighted) within-cluster sum of squares for each cluster. |
size |
a vector of the (weighted) number of elements in each cluster. |
totss |
total sum of (weighted) squared distances between each element and the sample mean. This statistic is not dependent on the clustering result. |
tot.withinss |
total sum of (weighted) within-cluster squared distances between each element and its cluster mean. This statistic is minimized given the number of clusters. |
betweenss |
sum of (weighted) squared distances between each cluster mean and sample mean. This statistic is maximized given the number of clusters. |
xname |
a character string. The actual name of the |
yname |
a character string. The actual name of the |
Each class has a print and a plot method, which are described along with print.Ckmeans.1d.dp
and plot.Ckmeans.1d.dp
.
Author(s)
Joe Song and Haizhou Wang
References
Song M, Zhong H (2020).
“Efficient weighted univariate clustering maps outstanding dysregulated genomic zones in human cancers.”
Bioinformatics, 36(20), 5027–5036.
doi:10.1093/bioinformatics/btaa613.
Wang H, Song M (2011).
“Ckmeans.1d.dp: Optimal k
-means clustering in one dimension by dynamic programming.”
The R Journal, 3(2), 29–33.
doi:10.32614/RJ-2011-015.
See Also
ahist
, plot.Ckmeans.1d.dp
, print.Ckmeans.1d.dp
in this package.
kmeans
in package stats that implements several heuristic k
-means algorithms.
Examples
# Ex. 1 The number of clusters is provided.
# Generate data from a Gaussian mixture model of three components
x <- c(rnorm(50, sd=0.2), rnorm(50, mean=1, sd=0.3), rnorm(100,
mean=-1, sd=0.25))
# Divide x into 3 clusters
k <- 3
result <- Ckmedian.1d.dp(x, k)
plot(result, main="Optimal univariate k-median given k")
result <- Ckmeans.1d.dp(x, k)
plot(result, main="Optimal univariate k-means given k")
plot(x, col=result$cluster, pch=result$cluster, cex=1.5,
main="Optimal univariate k-means clustering given k",
sub=paste("Number of clusters given:", k))
abline(h=result$centers, col=1:k, lty="dashed", lwd=2)
legend("bottomleft", paste("Cluster", 1:k), col=1:k, pch=1:k,
cex=1.5, bty="n")
# Ex. 2 The number of clusters is determined by Bayesian
# information criterion
# Generate data from a Gaussian mixture model of three components
x <- c(rnorm(50, mean=-3, sd=1), rnorm(50, mean=0, sd=.5),
rnorm(50, mean=3, sd=1))
# Divide x into k clusters, k automatically selected (default: 1~9)
result <- Ckmedian.1d.dp(x)
plot(result, main="Optimal univariate k-median with k estimated")
result <- Ckmeans.1d.dp(x)
plot(result, main="Optimal univariate k-means with k estimated")
k <- max(result$cluster)
plot(x, col=result$cluster, pch=result$cluster, cex=1.5,
main="Optimal univariate k-means clustering with k estimated",
sub=paste("Number of clusters is estimated to be", k))
abline(h=result$centers, col=1:k, lty="dashed", lwd=2)
legend("topleft", paste("Cluster", 1:k), col=1:k, pch=1:k,
cex=1.5, bty="n")
# Ex. 3 Segmenting a time course using optimal weighted
# univariate clustering
n <- 160
t <- seq(0, 2*pi*2, length=n)
n1 <- 1:(n/2)
n2 <- (max(n1)+1):n
y1 <- abs(sin(1.5*t[n1]) + 0.1*rnorm(length(n1)))
y2 <- abs(sin(0.5*t[n2]) + 0.1*rnorm(length(n2)))
y <- c(y1, y2)
w <- y^8 # stress the peaks
res <- Ckmeans.1d.dp(t, k=c(1:10), w)
plot(res)
plot(t, w, main = "Time course weighted k-means",
col=res$cluster, pch=res$cluster,
xlab="Time t", ylab="Transformed intensity w",
type="h")
abline(v=res$centers, col="chocolate", lty="dashed")
text(res$centers, max(w) * .95, cex=0.5, font=2,
paste(round(res$size / sum(res$size) * 100), "/ 100"))