Univariate Segmentation {Ckmeans.1d.dp}R Documentation

Optimal Univariate Segmentation

Description

Perform optimal univariate k-segmentation.

Usage

Cksegs.1d.dp(y, k=c(1,9), x=seq_along(y),
             method=c("quadratic", "linear", "loglinear"),
             estimate.k=c("BIC", "BIC 3.4.12"))

Arguments

y

a numeric vector of y values. Values can be negative.

k

either an exact integer number of clusters, or a vector of length two specifying the minimum and maximum numbers of clusters to be examined. The default is c(1,9). When k is a range, the actual number of clusters is determined by Bayesian information criterion.

x

an optional numeric vector of data to be clustered. All NA elements must be removed from x before calling this function. The function will run faster on sorted x (in non-decreasing order) than an unsorted input.

method

a character string to specify the speedup method to the original cubic runtime dynamic programming. The default is "quadratic", which generates optimal results. The other options do not guarantee optimal solution and differ in runtime or memory usage. See Details.

estimate.k

a character string to specify the method to estimate optimal k. The default is "BIC". See Details.

Details

Cksegs.1d.dp minimizes within-cluster sum of squared distance on y. It offers optimal piece-wise constant approximation of y within clusters of x. Only method="quadratic" guarantees optimality. The "linear" and "loglinear" options are faster but not always optimal and are provided for comparison purposes.

The Bayesian information criterion (BIC) method to select optimal k is updated to deal with duplicates in the data. Otherwise, the estimated k would be the same with previous versions. Set estimate.k="BIC" to use the latest method; use estimate.k="BIC 3.4.12" to use the BIC method in version 3.4.12 or earlier to estimated k from the given range. This option is effective only when a range for k is provided.

method specifies one of three options to speed up the original dynamic programming taking a runtime cubic in sample size n. The default "quadratic" option, with a runtime of O(kn^2), guarantees optimality. The next two options do not guarantee optimality. The "linear" option, giving a total runtime of O(n \lg n + kn) or O(kn) (if x is already sorted in ascending order) is the fastest option but uses the most memory (still O(kn)); the "loglinear" option, with a runtime of O(kn \lg n), is slightly slower but uses the least memory.

Value

An object of class "Cksegs.1d.dp". It is a list containing the following components:

cluster

a vector of clusters assigned to each element in x. Each cluster is indexed by an integer from 1 to k.

centers

a numeric vector of the (weighted) means for each cluster.

withinss

a numeric vector of the (weighted) within-cluster sum of squares for each cluster.

size

a vector of the (weighted) number of elements in each cluster.

totss

total sum of (weighted) squared distances between each element and the sample mean. This statistic is not dependent on the clustering result.

tot.withinss

total sum of (weighted) within-cluster squared distances between each element and its cluster mean. This statistic is minimized given the number of clusters.

betweenss

sum of (weighted) squared distances between each cluster mean and sample mean. This statistic is maximized given the number of clusters.

xname

a character string. The actual name of the x argument.

yname

a character string. The actual name of the y argument.

The class has a print and a plot method: print.Cksegs.1d.dp and plot.Cksegs.1d.dp.

Author(s)

Joe Song

See Also

plot.Cksegs.1d.dp and print.Cksegs.1d.dp.

Examples

# Ex 1. Segmenting by y

y <- c(1,1,1,2,2,2,4,4,4,4)

res <- Cksegs.1d.dp(y, k=c(1:10))

main <- "k-segs giving 3 clusters\nsucceeded in finding segments"

opar <- par(mfrow=c(1,2))

plot(res, main=main, xlab="x")

res <- Ckmeans.1d.dp(x=seq_along(y), k=c(1:10), y)
main <- "Weighted k-means giving 1 cluster\nfailed to find segments"

plot(res, main=main, xlab="x")

par(opar)

# Ex 2. Segmenting by y

y <- c(1,1,1.1,1, 2,2.5,2, 4,5,4,4)
res <- Cksegs.1d.dp(y, k=c(1:10))
plot(res, xlab="x")

# Ex 3. Segmenting a sinusoidal curve by y
x <- 1:125
y <- sin(x * .2)
res.q <- Cksegs.1d.dp(y, k=8, x=x)
plot(res.q, lwd=3, xlab="x")

# Ex 4. Segmenting by y

y <- rep(c(1,-3,4,-2), each=20)
y <- y + 0.5*rnorm(length(y))
k <- 1:10
res.q <- Cksegs.1d.dp(y, k=k, method="quadratic")
main <- paste("Cksegs (method=\"quadratic\"):\ntot.withinss =",
              format(res.q$tot.withinss, digits=4), "BIC =",
              format(res.q$BIC[length(res.q$size)], digits=4),
              "\nGUARANTEE TO BE OPTIMAL")
plot(res.q, main=main, xlab="x")
res.l <- Cksegs.1d.dp(y, k=k, method="linear")
main <- paste("Cksegs (method=\"linear\"):\ntot.withinss =",
               format(res.l$tot.withinss, digits=4), "BIC =",
              format(res.l$BIC[length(res.l$size)], digits=4),
               "\nFAST BUT MAY NOT BE OPTIMAL")
plot(res.l, main=main, xlab="x")
res.g <- Cksegs.1d.dp(y, k=k, method="loglinear")
main <- paste("Cksegs (method=\"loglinear\"):\ntot.withinss =",
              format(res.g$tot.withinss, digits=4), "BIC =",
              format(res.g$BIC[length(res.g$size)], digits=4),
              "\nFAST BUT MAY NOT BE OPTIMAL")
plot(res.g, main=main, xlab="x")

[Package Ckmeans.1d.dp version 4.3.5 Index]