processTimeseries {segmenTier} | R Documentation |
Process a time-series for clustering and segmentation.
Description
Prepares a time-series (time points in columns) for subsequent
clustering, and performs requested data transformations, including
a Discrete Fourier Transform (DFT) of the time-series, as direct
input for the clustering wrapper
clusterTimeseries
. When used for segmentation
the row order reflects the order of the data points along which
segmentation will occur. The function can also be used as a
stand-alone function equipped especially for analysis of
oscillatory time-series, including calculation of phases and
p-values for all DFT components, and can also be used for
Fourier Analysis and subsequent clustering without segmentation.
Usage
processTimeseries(ts, na2zero = FALSE, trafo = "raw",
use.fft = FALSE, dc.trafo = "raw", dft.range, perm = 0,
use.snr = FALSE, lambda = 1, low.thresh = -Inf, smooth.space = 1,
smooth.time = 1, circular.time = FALSE, verb = 0)
Arguments
ts |
a time-series as a matrix, where columns are the time points and rows are ordered measurements, e.g., genomic positions for transcriptome data |
na2zero |
interpret NA values as 0 |
trafo |
prior data transformation, pass any function name,
e.g., "log", or the package functions "ash" (asinh:
|
use.fft |
use the Discrete Fourier Transform of the data |
dc.trafo |
data transformation for the first (DC) component of
the DFT, pass any function name, e.g., "log", or the package
functions "ash" (asinh: |
dft.range |
a vector of integers, giving the components of the Discrete Fourier Transform to be used where 1 is the first component (DC) corresponding to the total signal (sum over all time points), and 2:n are the higher components corresponding to 2:n full cycles in the data |
perm |
number of permutations of the data set, to obtain p-values for the oscillation |
use.snr |
use a scaled amplitude, where each component of the Discrete Fourier Transform is divided by the mean of all other components (without the first or DC component), a normalization that can be interpreted to reflect a signal-to-noise ratio (SNR) |
lambda |
parameter lambda for Box-Cox transformation of DFT amplitudes (experimental; not tested) |
low.thresh |
use this threshold to cut-off data, which will be added to the absent/nuisance cluster later |
smooth.space |
integer, if set a moving average is calculated
for each time-point between adjacent data points using stats
package's |
smooth.time |
integer, if set the time-series will be smoothed
using stats package's |
circular.time |
logical value indicating whether time can be
treated as circular in smoothing via option |
verb |
level of verbosity, 0: no output, 1: progress messages |
Details
This function exemplifies the processing of an oscillatory
transcriptome time-series data as used in the establishment of this
algorithm and the demo segment_data
. As suggested by Machne & Murray
(PLoS ONE 2012) and Lehmann et al. (BMC Bioinformatics 2014) a Discrete
Fourier Transform of time-series data allows to cluster time-series by
their change pattern.
Note that NA values are here interpreted as 0. Please take care of NA values yourself, if you do not want this behavior.
Rows consisting only of 0 (or NA) values, or with a total signal
(sum over all time points) below the value passed in argument
low.thresh
, are detected, result in NA values in the
transformed data, and will be assigned to the
"nuisance" cluster in clusterTimeseries
.
Discrete Fourier Transform (DFT): if requested (option
use.fft=TRUE
), a DFT will be applied using base R's
mvfft
function and reporting all or only
requested (option dft.range
) DFT components, where the
first, or DC ("direct current") component, equals the total signal
(sum over all points) and other components are numbered 1:n,
reflecting the number of full cycles in the time-series. Values are
reported as complex numbers, from which both amplitude and phase
can be calculated. All returned DFT components will be used by
clusterTimeseries
.
Additional Transformations: data can be transformed prior to DFT
(options trafo
, smooth.time
, smooth.space
), or
after DFT (options use.snr
and dc.trafo
). It is
recommended to use the amplitude scaling (a signal-to-noise ratio
transformation, see option documentation). The separate
transformation of the DC component allows to de-emphasize the total
signal in subsequent clustering & segmentation. Additionally, but
not tested in the context of segmentation, a Box-Cox transformation
of the DFT can be performed (option lambda
). This
transformation proofed useful in DFT-based clustering with the
model-based clustering algorithm in package flowClust, and is
available here for further tests with k-means clustering.
Phase, Amplitude and Permutation Analysis: this time-series
processing and subsequent clustering can also be used without
segmentation, eg. for conventional microarray data or RNA-seq data
already mapped to genes. The option perm
allows to perform a
permutation test (perm
times) and adds a matrix of empirical
p-values for all DFT components to the results object, ie. the
fraction of perm
where amplitude was higher then the
amplitude of the randomized time-series. Phases and amplitudes can
be derived from the complex numbers in matrix "dft" of the result
object.
Value
Returns a list of class "timeseries" which comprises of the transformed time-series and additional information, such as the total signal, and positions of rows with only NA/0 values. Note that NA values are interpreted as 0.
References
Machne & Murray (2012) <doi:10.1371/journal.pone.0037906>, and Lehmann et al. (2013) <doi:10.1186/1471-2105-14-133>
Examples
data(primseg436)
## The input data is a matrix with time points in columns
## and a 1D order, here 7624 genome positions, is reflected in rows,
## if the time-series should be segmented.
nrow(tsd)
## Time-series processing prepares the data for clustering,
## the example data is periodic, and we will cluster its Discrete Fourier
## Transform (DFT) rather then the original data. Specifically we will
## only use components 1 to 7 of the DFT (dft.range) and also apply
## a signal/noise ratio normalization, where each component is
## divided by the mean of all other components. To de-emphasize
## total levels the first component (DC for "direct current") of the
## DFT will be separately arcsinh transformed. This peculiar combination
## proofed best for our data:
tset <- processTimeseries(ts=tsd, na2zero=TRUE, use.fft=TRUE,
dft.range=1:7, dc.trafo="ash", use.snr=TRUE)
## a plot method exists for the returned time-series class:
par(mfcol=c(2,1))
plot(tset)