R: Run the 'segmenTier' algorithm.

segmentClusters {segmenTier}

R Documentation

Run the `segmenTier` algorithm.

Description

segmenTier's main wrapper interface, calculates segments from a clustering sequence. This will run the segmentation algorithm once for the indicated parameters. The function segmentCluster.batch allows for multiple runs over different parameters or input-clusterings.

Usage

segmentClusters(seq, k = 1, csim, E = 1, S = "ccor", M = 175,
  Mn = 20, a = -2, nui = 1, nextmax = TRUE, multi = "max",
  multib = "max", rm.nui = TRUE, save.matrix = FALSE, verb = 1)

Arguments

`seq`	Either an integer vector of cluster labels, or a structure of class 'clustering' as returned by `clusterTimeseries`. The only strict requirement for the first option is that nuisance clusters (which will be treated specially during the dynamic programming routine) have to be '0' (zero).
`k`	if argument `seq` is of class 'clustering' the kth clustering will be used; defaults to 1
`csim`	The cluster-cluster or position-cluster similarity matrix for scoring functions "ccor" and "icor" (option `S`), respectively. If `seq` is of class 'clustering' `csim` is optional and will override the similarity matrices in `seq`. If argument `seq` is a simple vector of cluster labels and the scoring function is "icor" or "ccor", an appropriate matrix `csim` MUST be provided. Finally, for scoring function "ccls" the argument `csim` will be ignored and the matrix is instead automatically constructed from argument `a`, and using argument `nui` for the nuisance cluster.
`E`	exponent to scale similarity matrices
`S`	the scoring function to be used: "ccor", "icor" or "ccls"
`M`	segment length penalty. Note, that this is not a strict cut-off but defined as a penalty that must be "overcome" by good score.
`Mn`	segment length penalty for nuisance cluster. Mn<M will allow shorter distances between "real" segments; only used in scoring functions "ccor" and "icor"
`a`	a cluster "dissimilarity" only used for pure cluster-based scoring w/o cluster similarity measures in scoring function "ccls".
`nui`	the similarity score to be used for nuisance clusters in the cluster similarity matrices
`nextmax`	go backwards while score is increasing before opening a new segment, default is TRUE
`multi`	handling of multiple k with max. score in forward phase, either "min" (default) or "max"
`multib`	handling of multiple k with max. score in back-trace phase, either "min" (default), "max" or "skip"
`rm.nui`	remove nuisance cluster segments from final results
`save.matrix`	store the total score matrix `S(i,c)` and the backtracing matrix `K(i,c)`; useful in testing stage or for debugging or illustration of the algorithm;
`verb`	level of verbosity, 0: no output, 1: progress messages

Details

This is the main R wrapper function for the ‘segmenTier’ segmentation algorithm. It takes an ordered sequence of cluster labels and returns segments of consistent clusterings, where cluster-cluster or cluster-position similarities are maximal. Its main input (argument seq) is either a "clustering" object returned by clusterTimeseries (scenario I), or an integer vector of cluster labels (scenario II) or. The function then runs the dynamic programming algorithm (calculateScore) for a selected scoring function and an according cluster similarity matrix, followed by the back-tracing step (backtrace) to find segment borders.

The main result, list item "segments" of the returned object, is a 3-column matrix, where column 1 is the cluster assignment and columns 2 and 3 are start and end indices of the segments. For the batch function segmentCluster.batch, the "segments" item is a data.frame contain additional information, see ?segmentCluster.batch.

As shown in the publication, the parameters M, E and nui have the strongest impact on resulting segment borders. Other parameters can be fine-tuned but had little impact on our test data set.

In the default and tested scenario I, when the input is an object of class "clustering" produced by clusterTimeseries, the cluster-cluster and cluster-position similarity matrices are already provided by this object.

In the second scenario II for custom use, argument seq can be a simple clustering vector, where a nuisance cluster must be indicated by cluster label "0" (zero). The cluster-cluster or cluster-position similarities MUST be provided (argument csim) for scoring functions "ccor" and "icor", respectively. For the simplest scoring function "ccls", a uniform cluster similarity matrix is constructed from arguments a and nui, with cluster self-similarities of 1, "dissimilarities" between different clusters using argument a<0, and nuisance cluster self-similarity of -a.

The function returns a list (class "segments") comprising of the main result (list item "segments"), and "warnings" from the dynamic programming and backtracing phases, the used similarity matrix csim, extended by the nuisance cluster; and optionally (see option save.matrix) the scoring vectors S1(i,c), the total score matrix S(i,c) and the backtracing matrix K(i,c) for analysis of algorithm performance for novel data sets. Additional convenience data is reported, such as cluster colors and sortings if argument seq was of class 'clustering'. These allow for convenient inspection of all data processing steps with the plot methods. A plot method exists that allows to plot segments aligned to "timeseries" and "clustering" plots.

Value

Returns a list (class "segments") containing the main result (list item "segments"), and additional information (see ‘Details’). A plot method exists that allows to plot clusters aligned to time-series and segmentation plots.

References

Machne, Murray & Stadler (2017) <doi:10.1038/s41598-017-12401-8>

Examples

# load example data, an RNA-seq time-series data from a short genomic region
# of budding yeast
data(primseg436)

# 1) Fourier-transform time series:
## NOTE: reducing official example data set to stay within 
## CRAN example timing restrictions with segmentation below
tset <- processTimeseries(ts=tsd[2500:6500,], na2zero=TRUE, use.fft=TRUE,
                          dft.range=1:7, dc.trafo="ash", use.snr=TRUE)

# 2) cluster time-series into K=12 clusters:
cset <- clusterTimeseries(tset, K=12)

# 3) ... segment it; this takes a few seconds:
segments <- segmentClusters(seq=cset, M=100, E=2, nui=3, S="icor")

# 4) inspect results:
print(segments)
plotSegmentation(tset, cset, segments, cex=.5, lwd=3)

# 5) and get segment border table for further processing:
sgtable <- segments$segments

[Package segmenTier version 0.1.2 Index]

Run the segmenTier algorithm.