Univariate Segmentation {Ckmeans.1d.dp} | R Documentation |

Perform optimal univariate *k*-segmentation.

Cksegs.1d.dp(y, k=c(1,9), x=seq_along(y), method=c("quadratic", "linear", "loglinear"), estimate.k=c("BIC", "BIC 3.4.12"))

`y` |
a numeric vector of y values. Values can be negative. |

`k` |
either an exact integer number of clusters, or a vector of length two specifying the minimum and maximum numbers of clusters to be examined. The default is |

`x` |
an optional numeric vector of data to be clustered. All |

`method` |
a character string to specify the speedup method to the original cubic runtime dynamic programming. The default is |

`estimate.k` |
a character string to specify the method to estimate optimal |

`Cksegs.1d.dp`

minimizes within-cluster sum of squared distance on `y`

. It offers optimal piece-wise constant approximation of `y`

within clusters of `x`

. Only `method="quadratic"`

guarantees optimality. The "linear" and "loglinear" options are faster but not always optimal and are provided for comparison purposes.

The Bayesian information criterion (BIC) method to select optimal `k`

is updated to deal with duplicates in the data. Otherwise, the estimated k would be the same with previous versions. Set `estimate.k="BIC"`

to use the latest method; use `estimate.k="BIC 3.4.12"`

to use the BIC method in version 3.4.12 or earlier to estimated `k`

from the given range. This option is effective only when a range for `k`

is provided.

`method`

specifies one of three options to speed up the original dynamic programming taking a runtime cubic in sample size `n`. The default `"quadratic"`

option, with a runtime of *O(kn^2)*, guarantees optimality. The next two options do not guarantee optimality. The `"linear"`

option, giving a total runtime of *O(n lg n + kn)* or *O(kn)* (if `x`

is already sorted in ascending order) is the fastest option but uses the most memory (still *O(kn)*); the `"loglinear"`

option, with a runtime of *O(kn lg n)*, is slightly slower but uses the least memory.

An object of class "`Cksegs.1d.dp`

". It is a list containing the following components:

`cluster` |
a vector of clusters assigned to each element in |

`centers` |
a numeric vector of the (weighted) means for each cluster. |

`withinss` |
a numeric vector of the (weighted) within-cluster sum of squares for each cluster. |

`size` |
a vector of the (weighted) number of elements in each cluster. |

`totss` |
total sum of (weighted) squared distances between each element and the sample mean. This statistic is not dependent on the clustering result. |

`tot.withinss` |
total sum of (weighted) within-cluster squared distances between each element and its cluster mean. This statistic is minimized given the number of clusters. |

`betweenss` |
sum of (weighted) squared distances between each cluster mean and sample mean. This statistic is maximized given the number of clusters. |

`xname` |
a character string. The actual name of the |

`yname` |
a character string. The actual name of the |

The class has a print and a plot method: `print.Cksegs.1d.dp`

and `plot.Cksegs.1d.dp`

.

Joe Song

`plot.Cksegs.1d.dp`

and `print.Cksegs.1d.dp`

.

# Ex 1. Segmenting by y y <- c(1,1,1,2,2,2,4,4,4,4) res <- Cksegs.1d.dp(y, k=c(1:10)) main <- "k-segs giving 3 clusters\nsucceeded in finding segments" opar <- par(mfrow=c(1,2)) plot(res, main=main, xlab="x") res <- Ckmeans.1d.dp(x=seq_along(y), k=c(1:10), y) main <- "Weighted k-means giving 1 cluster\nfailed to find segments" plot(res, main=main, xlab="x") par(opar) # Ex 2. Segmenting by y y <- c(1,1,1.1,1, 2,2.5,2, 4,5,4,4) res <- Cksegs.1d.dp(y, k=c(1:10)) plot(res, xlab="x") # Ex 3. Segmenting a sinusoidal curve by y x <- 1:125 y <- sin(x * .2) res.q <- Cksegs.1d.dp(y, k=8, x=x) plot(res.q, lwd=3, xlab="x") # Ex 4. Segmenting by y y <- rep(c(1,-3,4,-2), each=20) y <- y + 0.5*rnorm(length(y)) k <- 1:10 res.q <- Cksegs.1d.dp(y, k=k, method="quadratic") main <- paste("Cksegs (method=\"quadratic\"):\ntot.withinss =", format(res.q$tot.withinss, digits=4), "BIC =", format(res.q$BIC[length(res.q$size)], digits=4), "\nGUARANTEE TO BE OPTIMAL") plot(res.q, main=main, xlab="x") res.l <- Cksegs.1d.dp(y, k=k, method="linear") main <- paste("Cksegs (method=\"linear\"):\ntot.withinss =", format(res.l$tot.withinss, digits=4), "BIC =", format(res.l$BIC[length(res.l$size)], digits=4), "\nFAST BUT MAY NOT BE OPTIMAL") plot(res.l, main=main, xlab="x") res.g <- Cksegs.1d.dp(y, k=k, method="loglinear") main <- paste("Cksegs (method=\"loglinear\"):\ntot.withinss =", format(res.g$tot.withinss, digits=4), "BIC =", format(res.g$BIC[length(res.g$size)], digits=4), "\nFAST BUT MAY NOT BE OPTIMAL") plot(res.g, main=main, xlab="x")

[Package *Ckmeans.1d.dp* version 4.3.3 Index]