Univariate Segmentation {Ckmeans.1d.dp} | R Documentation |
Optimal Univariate Segmentation
Description
Perform optimal univariate k
-segmentation.
Usage
Cksegs.1d.dp(y, k=c(1,9), x=seq_along(y),
method=c("quadratic", "linear", "loglinear"),
estimate.k=c("BIC", "BIC 3.4.12"))
Arguments
y |
a numeric vector of y values. Values can be negative. |
k |
either an exact integer number of clusters, or a vector of length two specifying the minimum and maximum numbers of clusters to be examined. The default is |
x |
an optional numeric vector of data to be clustered. All |
method |
a character string to specify the speedup method to the original cubic runtime dynamic programming. The default is |
estimate.k |
a character string to specify the method to estimate optimal |
Details
Cksegs.1d.dp
minimizes within-cluster sum of squared distance on y
. It offers optimal piece-wise constant approximation of y
within clusters of x
. Only method="quadratic"
guarantees optimality. The "linear" and "loglinear" options are faster but not always optimal and are provided for comparison purposes.
The Bayesian information criterion (BIC) method to select optimal k
is updated to deal with duplicates in the data. Otherwise, the estimated k would be the same with previous versions. Set estimate.k="BIC"
to use the latest method; use estimate.k="BIC 3.4.12"
to use the BIC method in version 3.4.12 or earlier to estimated k
from the given range. This option is effective only when a range for k
is provided.
method
specifies one of three options to speed up the original dynamic programming taking a runtime cubic in sample size n. The default "quadratic"
option, with a runtime of O(kn^2)
, guarantees optimality. The next two options do not guarantee optimality. The "linear"
option, giving a total runtime of O(n \lg n + kn)
or O(kn)
(if x
is already sorted in ascending order) is the fastest option but uses the most memory (still O(kn)
); the "loglinear"
option, with a runtime of O(kn \lg n)
, is slightly slower but uses the least memory.
Value
An object of class "Cksegs.1d.dp
". It is a list containing the following components:
cluster |
a vector of clusters assigned to each element in |
centers |
a numeric vector of the (weighted) means for each cluster. |
withinss |
a numeric vector of the (weighted) within-cluster sum of squares for each cluster. |
size |
a vector of the (weighted) number of elements in each cluster. |
totss |
total sum of (weighted) squared distances between each element and the sample mean. This statistic is not dependent on the clustering result. |
tot.withinss |
total sum of (weighted) within-cluster squared distances between each element and its cluster mean. This statistic is minimized given the number of clusters. |
betweenss |
sum of (weighted) squared distances between each cluster mean and sample mean. This statistic is maximized given the number of clusters. |
xname |
a character string. The actual name of the |
yname |
a character string. The actual name of the |
The class has a print and a plot method: print.Cksegs.1d.dp
and plot.Cksegs.1d.dp
.
Author(s)
Joe Song
See Also
plot.Cksegs.1d.dp
and print.Cksegs.1d.dp
.
Examples
# Ex 1. Segmenting by y
y <- c(1,1,1,2,2,2,4,4,4,4)
res <- Cksegs.1d.dp(y, k=c(1:10))
main <- "k-segs giving 3 clusters\nsucceeded in finding segments"
opar <- par(mfrow=c(1,2))
plot(res, main=main, xlab="x")
res <- Ckmeans.1d.dp(x=seq_along(y), k=c(1:10), y)
main <- "Weighted k-means giving 1 cluster\nfailed to find segments"
plot(res, main=main, xlab="x")
par(opar)
# Ex 2. Segmenting by y
y <- c(1,1,1.1,1, 2,2.5,2, 4,5,4,4)
res <- Cksegs.1d.dp(y, k=c(1:10))
plot(res, xlab="x")
# Ex 3. Segmenting a sinusoidal curve by y
x <- 1:125
y <- sin(x * .2)
res.q <- Cksegs.1d.dp(y, k=8, x=x)
plot(res.q, lwd=3, xlab="x")
# Ex 4. Segmenting by y
y <- rep(c(1,-3,4,-2), each=20)
y <- y + 0.5*rnorm(length(y))
k <- 1:10
res.q <- Cksegs.1d.dp(y, k=k, method="quadratic")
main <- paste("Cksegs (method=\"quadratic\"):\ntot.withinss =",
format(res.q$tot.withinss, digits=4), "BIC =",
format(res.q$BIC[length(res.q$size)], digits=4),
"\nGUARANTEE TO BE OPTIMAL")
plot(res.q, main=main, xlab="x")
res.l <- Cksegs.1d.dp(y, k=k, method="linear")
main <- paste("Cksegs (method=\"linear\"):\ntot.withinss =",
format(res.l$tot.withinss, digits=4), "BIC =",
format(res.l$BIC[length(res.l$size)], digits=4),
"\nFAST BUT MAY NOT BE OPTIMAL")
plot(res.l, main=main, xlab="x")
res.g <- Cksegs.1d.dp(y, k=k, method="loglinear")
main <- paste("Cksegs (method=\"loglinear\"):\ntot.withinss =",
format(res.g$tot.withinss, digits=4), "BIC =",
format(res.g$BIC[length(res.g$size)], digits=4),
"\nFAST BUT MAY NOT BE OPTIMAL")
plot(res.g, main=main, xlab="x")