R: Process a time-series for clustering and segmentation.

processTimeseries {segmenTier}

R Documentation

Process a time-series for clustering and segmentation.

Description

Prepares a time-series (time points in columns) for subsequent clustering, and performs requested data transformations, including a Discrete Fourier Transform (DFT) of the time-series, as direct input for the clustering wrapper clusterTimeseries. When used for segmentation the row order reflects the order of the data points along which segmentation will occur. The function can also be used as a stand-alone function equipped especially for analysis of oscillatory time-series, including calculation of phases and p-values for all DFT components, and can also be used for Fourier Analysis and subsequent clustering without segmentation.

Usage

processTimeseries(ts, na2zero = FALSE, trafo = "raw",
  use.fft = FALSE, dc.trafo = "raw", dft.range, perm = 0,
  use.snr = FALSE, lambda = 1, low.thresh = -Inf, smooth.space = 1,
  smooth.time = 1, circular.time = FALSE, verb = 0)

Arguments

`ts`	a time-series as a matrix, where columns are the time points and rows are ordered measurements, e.g., genomic positions for transcriptome data
`na2zero`	interpret NA values as 0
`trafo`	prior data transformation, pass any function name, e.g., "log", or the package functions "ash" (asinh: `ash(x) = log(x + sqrt(x^2+1))`) or "log_1" (`log(ts+1)`)
`use.fft`	use the Discrete Fourier Transform of the data
`dc.trafo`	data transformation for the first (DC) component of the DFT, pass any function name, e.g., "log", or the package functions "ash" (asinh: `ash(x) = log(x + sqrt(x^2+1))`) or "log_1" (`log(x+1)`).
`dft.range`	a vector of integers, giving the components of the Discrete Fourier Transform to be used where 1 is the first component (DC) corresponding to the total signal (sum over all time points), and 2:n are the higher components corresponding to 2:n full cycles in the data
`perm`	number of permutations of the data set, to obtain p-values for the oscillation
`use.snr`	use a scaled amplitude, where each component of the Discrete Fourier Transform is divided by the mean of all other components (without the first or DC component), a normalization that can be interpreted to reflect a signal-to-noise ratio (SNR)
`lambda`	parameter lambda for Box-Cox transformation of DFT amplitudes (experimental; not tested)
`low.thresh`	use this threshold to cut-off data, which will be added to the absent/nuisance cluster later
`smooth.space`	integer, if set a moving average is calculated for each time-point between adjacent data points using stats package's `smooth` with option `span=smooth.space`
`smooth.time`	integer, if set the time-series will be smoothed using stats package's `filter` to calculate a moving average with span `smooth.time` and `smoothEnds` to extrapolate smoothed first and last time-points (again using span `smooth.time`)
`circular.time`	logical value indicating whether time can be treated as circular in smoothing via option `smooth.time`
`verb`	level of verbosity, 0: no output, 1: progress messages

Details

This function exemplifies the processing of an oscillatory transcriptome time-series data as used in the establishment of this algorithm and the demo segment_data. As suggested by Machne & Murray (PLoS ONE 2012) and Lehmann et al. (BMC Bioinformatics 2014) a Discrete Fourier Transform of time-series data allows to cluster time-series by their change pattern.

Note that NA values are here interpreted as 0. Please take care of NA values yourself, if you do not want this behavior.

Rows consisting only of 0 (or NA) values, or with a total signal (sum over all time points) below the value passed in argument low.thresh, are detected, result in NA values in the transformed data, and will be assigned to the "nuisance" cluster in clusterTimeseries.

Discrete Fourier Transform (DFT): if requested (option use.fft=TRUE), a DFT will be applied using base R's mvfft function and reporting all or only requested (option dft.range) DFT components, where the first, or DC ("direct current") component, equals the total signal (sum over all points) and other components are numbered 1:n, reflecting the number of full cycles in the time-series. Values are reported as complex numbers, from which both amplitude and phase can be calculated. All returned DFT components will be used by clusterTimeseries.

Additional Transformations: data can be transformed prior to DFT (options trafo, smooth.time, smooth.space), or after DFT (options use.snr and dc.trafo). It is recommended to use the amplitude scaling (a signal-to-noise ratio transformation, see option documentation). The separate transformation of the DC component allows to de-emphasize the total signal in subsequent clustering & segmentation. Additionally, but not tested in the context of segmentation, a Box-Cox transformation of the DFT can be performed (option lambda). This transformation proofed useful in DFT-based clustering with the model-based clustering algorithm in package flowClust, and is available here for further tests with k-means clustering.

Phase, Amplitude and Permutation Analysis: this time-series processing and subsequent clustering can also be used without segmentation, eg. for conventional microarray data or RNA-seq data already mapped to genes. The option perm allows to perform a permutation test (perm times) and adds a matrix of empirical p-values for all DFT components to the results object, ie. the fraction of perm where amplitude was higher then the amplitude of the randomized time-series. Phases and amplitudes can be derived from the complex numbers in matrix "dft" of the result object.

Value

Returns a list of class "timeseries" which comprises of the transformed time-series and additional information, such as the total signal, and positions of rows with only NA/0 values. Note that NA values are interpreted as 0.

References

Machne & Murray (2012) <doi:10.1371/journal.pone.0037906>, and Lehmann et al. (2013) <doi:10.1186/1471-2105-14-133>

Examples

data(primseg436)
## The input data is a matrix with time points in columns
## and a 1D order, here 7624 genome positions, is reflected in rows,
## if the time-series should be segmented.
nrow(tsd)
## Time-series processing prepares the data for clustering,
## the example data is periodic, and we will cluster its Discrete Fourier
## Transform (DFT) rather then the original data. Specifically we will
## only use components 1 to 7 of the DFT (dft.range) and also apply
## a signal/noise ratio normalization, where each component is
## divided by the mean of all other components. To de-emphasize
## total levels the first component (DC for "direct current") of the
## DFT will be separately arcsinh transformed. This peculiar combination
## proofed best for our data:
tset <- processTimeseries(ts=tsd, na2zero=TRUE, use.fft=TRUE,
                          dft.range=1:7, dc.trafo="ash", use.snr=TRUE)
## a plot method exists for the returned time-series class:
par(mfcol=c(2,1))
plot(tset)

[Package segmenTier version 0.1.2 Index]