R: Optimal Univariate Segmentation

Univariate Segmentation {Ckmeans.1d.dp}

R Documentation

Optimal Univariate Segmentation

Description

Perform optimal univariate k-segmentation.

Usage

Cksegs.1d.dp(y, k=c(1,9), x=seq_along(y),
             method=c("quadratic", "linear", "loglinear"),
             estimate.k=c("BIC", "BIC 3.4.12"))

Arguments

`y`	a numeric vector of y values. Values can be negative.
`k`	either an exact integer number of clusters, or a vector of length two specifying the minimum and maximum numbers of clusters to be examined. The default is `c(1,9)`. When `k` is a range, the actual number of clusters is determined by Bayesian information criterion.
`x`	an optional numeric vector of data to be clustered. All `NA` elements must be removed from `x` before calling this function. The function will run faster on sorted `x` (in non-decreasing order) than an unsorted input.
`method`	a character string to specify the speedup method to the original cubic runtime dynamic programming. The default is `"quadratic"`, which generates optimal results. The other options do not guarantee optimal solution and differ in runtime or memory usage. See Details.
`estimate.k`	a character string to specify the method to estimate optimal `k`. The default is `"BIC"`. See Details.

Details

Cksegs.1d.dp minimizes within-cluster sum of squared distance on y. It offers optimal piece-wise constant approximation of y within clusters of x. Only method="quadratic" guarantees optimality. The "linear" and "loglinear" options are faster but not always optimal and are provided for comparison purposes.

The Bayesian information criterion (BIC) method to select optimal k is updated to deal with duplicates in the data. Otherwise, the estimated k would be the same with previous versions. Set estimate.k="BIC" to use the latest method; use estimate.k="BIC 3.4.12" to use the BIC method in version 3.4.12 or earlier to estimated k from the given range. This option is effective only when a range for k is provided.

method specifies one of three options to speed up the original dynamic programming taking a runtime cubic in sample size n. The default "quadratic" option, with a runtime of O(kn^2), guarantees optimality. The next two options do not guarantee optimality. The "linear" option, giving a total runtime of O(n \lg n + kn) or O(kn) (if x is already sorted in ascending order) is the fastest option but uses the most memory (still O(kn)); the "loglinear" option, with a runtime of O(kn \lg n), is slightly slower but uses the least memory.

Value

An object of class "Cksegs.1d.dp". It is a list containing the following components:

`cluster`	a vector of clusters assigned to each element in `x`. Each cluster is indexed by an integer from 1 to `k`.
`centers`	a numeric vector of the (weighted) means for each cluster.
`withinss`	a numeric vector of the (weighted) within-cluster sum of squares for each cluster.
`size`	a vector of the (weighted) number of elements in each cluster.
`totss`	total sum of (weighted) squared distances between each element and the sample mean. This statistic is not dependent on the clustering result.
`tot.withinss`	total sum of (weighted) within-cluster squared distances between each element and its cluster mean. This statistic is minimized given the number of clusters.
`betweenss`	sum of (weighted) squared distances between each cluster mean and sample mean. This statistic is maximized given the number of clusters.
`xname`	a character string. The actual name of the `x` argument.
`yname`	a character string. The actual name of the `y` argument.

The class has a print and a plot method: print.Cksegs.1d.dp and plot.Cksegs.1d.dp.

Author(s)

Joe Song

Examples

# Ex 1. Segmenting by y

y <- c(1,1,1,2,2,2,4,4,4,4)

res <- Cksegs.1d.dp(y, k=c(1:10))

main <- "k-segs giving 3 clusters\nsucceeded in finding segments"

opar <- par(mfrow=c(1,2))

plot(res, main=main, xlab="x")

res <- Ckmeans.1d.dp(x=seq_along(y), k=c(1:10), y)
main <- "Weighted k-means giving 1 cluster\nfailed to find segments"

plot(res, main=main, xlab="x")

par(opar)

# Ex 2. Segmenting by y

y <- c(1,1,1.1,1, 2,2.5,2, 4,5,4,4)
res <- Cksegs.1d.dp(y, k=c(1:10))
plot(res, xlab="x")

# Ex 3. Segmenting a sinusoidal curve by y
x <- 1:125
y <- sin(x * .2)
res.q <- Cksegs.1d.dp(y, k=8, x=x)
plot(res.q, lwd=3, xlab="x")

# Ex 4. Segmenting by y

y <- rep(c(1,-3,4,-2), each=20)
y <- y + 0.5*rnorm(length(y))
k <- 1:10
res.q <- Cksegs.1d.dp(y, k=k, method="quadratic")
main <- paste("Cksegs (method=\"quadratic\"):\ntot.withinss =",
              format(res.q$tot.withinss, digits=4), "BIC =",
              format(res.q$BIC[length(res.q$size)], digits=4),
              "\nGUARANTEE TO BE OPTIMAL")
plot(res.q, main=main, xlab="x")
res.l <- Cksegs.1d.dp(y, k=k, method="linear")
main <- paste("Cksegs (method=\"linear\"):\ntot.withinss =",
               format(res.l$tot.withinss, digits=4), "BIC =",
              format(res.l$BIC[length(res.l$size)], digits=4),
               "\nFAST BUT MAY NOT BE OPTIMAL")
plot(res.l, main=main, xlab="x")
res.g <- Cksegs.1d.dp(y, k=k, method="loglinear")
main <- paste("Cksegs (method=\"loglinear\"):\ntot.withinss =",
              format(res.g$tot.withinss, digits=4), "BIC =",
              format(res.g$BIC[length(res.g$size)], digits=4),
              "\nFAST BUT MAY NOT BE OPTIMAL")
plot(res.g, main=main, xlab="x")

[Package Ckmeans.1d.dp version 4.3.5 Index]