R: The isa package

isa2-package {isa2}

R Documentation

The isa package

Description

The Iterative Signature Algorithm

Introduction

The Iterative Signature Algorithm (ISA) is a biclustering algorithm. Biclustering algorithms classify simultaneously the rows and columns of an input matrix into biclusters, or as we will call them here, modules.

For the impatient

The easiest way to run ISA is to call the isa function with your input matrix as the single argument. This does all steps of a typical ISA work flow, with the default parameters.

ISA biclusters

An ISA module is pair; a subset of the rows of the input matrix and a subset of its columns. In other words, a bicluster is a block of the reordered input matrix, where reordering means a permutation of both the rows and columns. (Another bicluster might be block of the same permuted input matrix or one after a different permutation.)

The criteria of a good bicluster is that 1) its rows are significantly different than the other rows, when we consider only the positions defined by the columns of the same bicluster, and (symmetrically) 2) its columns are significantly different than the other columns, when we consider only the positions defined by the rows of the same bicluster.

In other words, the rows of the bicluster are correlated, but only on the columns defined by the same bicluster; and the opposite is also true, the columns of the bicluster are correlated, but only on the rows defined by the same bicluster.

ISA biclusters are soft, two biclusters may overlap in their rows, columns or even both. It is also possible that some rows and/or columns of the input matrix are not found to be part of any ISA biclusters. Depending on the stringency parameters, it might even happen that ISA does not find any biclusters.

ISA row and column scores

ISA biclusters are not only soft, but every row and column in a given bicluster has a score, a number between minus one and one. The further this number is from zero, then stronger is the association of the given row or column to the bicluster.

How ISA works

ISA works in an iterative way. For an E (m\times n) input matrix it starts from seed vector r_0, which is typically a sparse 0/1 vector of length m. This defines a set of rows in E. Then E' is multiplied by r_0 and the result is thresholded. (Please see also ‘Normalization’ below.)

The thresholding is an important step of the ISA, without thresholding ISA would be equivalent to a (not too effective) numerical singular value decomposition (SVD). Currently thresholding is done by calculating the mean and standard deviation of the vector and keeping only elements that are further than a given number of standard deviations from the mean. Based on the direction parameter, this means 1) keeping values that are significantly higher than the mean (direction="up"), significantly lower (direction="down") or both (direction="updown").

The thresholded vector c_0 is the (column) ‘signature’ of r_0. Then the (row) signature of c_0 is calculated, E is multiplied by c_0 and then thresholded to get r_1.

This iteration is performed until it converges, i.e. r_i and r_{i-1} are “close”, and c_i and c_{i-1} are also close. The convergence criteria, i.e. what “close” means is by default defined by high Pearson correlation.

It is very possible that the ISA finds the same modules more than once; two or more seeds might converge to the same module. The function isa.unique eliminates every module from the result of isa.iterate that is very similar (in terms of Pearson correlation) to the one that was already found before it.

Parameters

The two main parameters of ISA are the two thresholds (one for the rows and one for the columns). They basically define the stringency of the modules. If the row threshold is high, then the modules will have very similar rows. If it is mild, then modules will be bigger, with less similar rows than in the first case.

Random seeding and smart seeding

By default (i.e. if the isa function is used) the ISA is performed from random sparse starting seeds, generated by generate.seeds. This way the algorithm is completely unsupervised, but also stochastic: it might give different results for different runs.

It is possible to use non-random seeds as well, if you have some knowledge about the data or are interested in a particular subset of rows/columns, then you can feed in your seeds into the isa.iterate function directly. In this case the algorithm is deterministic, for the same seed you will always get the same results.

Normalization

On in silico data we observed that ISA has the best performance if the input matrix is normalized (see isa.normalize). The normalization produces two matrices: E_r and E_c. E_r is calculated by transposing E and centering and scaling its rows (see scale). E_c is calculated by centering and scaling the rows of E. E_r is used to calculate the column signature of rows and E_c is used to calculate the signature of the columns.

It is possible to use another normalization, then the user is requested to supply the normalized input data in a named list, including the two matrices of appropriate dimensions. ‘Er’ will be used for calculating the signature of the rows, ‘Ec’ the signature of the columns. If you want to use the same matrix in both steps, then supply it twice, the first one transposed.

Robustness

As ISA is an unsupervised algorithm, it may very well find some modules, even if you feed in noise as an input matrix. To avoid these spurious modules we defined a robustness measure, a single number for a modules that gives how well the rows and the columns are correlated.

It recommended that the user uses isa.filter.robust to run ISA on the scrambled input matrix with the same threshold parameters and then drop every module, which has a robustness score lower than the highest robustness score among modules found in the scrambled data.

A typical ISA work flow

Please see the manual page and the source code of isa for a typical ISA work flow. (You can obtain the source code by typing ‘isa’ (without the apostrophes) into your R prompt and pressing ENTER.)

Author(s)

Gabor Csardi Gabor.Csardi@unil.ch

References

Bergmann S, Ihmels J, Barkai N: Iterative signature algorithm for the analysis of large-scale gene expression data Phys Rev E Stat Nonlin Soft Matter Phys. 2003 Mar;67(3 Pt 1):031902. Epub 2003 Mar 11.

Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N: Revealing modular organization in the yeast transcriptional network Nat Genet. 2002 Aug;31(4):370-7. Epub 2002 Jul 22

Ihmels J, Bergmann S, Barkai N: Defining transcription modules using large-scale gene expression data Bioinformatics 2004 Sep 1;20(13):1993-2003. Epub 2004 Mar 25.