isa2-package {isa2} | R Documentation |
The isa package
Description
The Iterative Signature Algorithm
Introduction
The Iterative Signature Algorithm (ISA) is a biclustering algorithm. Biclustering algorithms classify simultaneously the rows and columns of an input matrix into biclusters, or as we will call them here, modules.
For the impatient
The easiest way to run ISA is to call the isa
function
with your input matrix as the single argument. This does all steps of
a typical ISA work flow, with the default parameters.
ISA biclusters
An ISA module is pair; a subset of the rows of the input matrix and a subset of its columns. In other words, a bicluster is a block of the reordered input matrix, where reordering means a permutation of both the rows and columns. (Another bicluster might be block of the same permuted input matrix or one after a different permutation.)
The criteria of a good bicluster is that 1) its rows are significantly different than the other rows, when we consider only the positions defined by the columns of the same bicluster, and (symmetrically) 2) its columns are significantly different than the other columns, when we consider only the positions defined by the rows of the same bicluster.
In other words, the rows of the bicluster are correlated, but only on the columns defined by the same bicluster; and the opposite is also true, the columns of the bicluster are correlated, but only on the rows defined by the same bicluster.
ISA biclusters are soft, two biclusters may overlap in their rows, columns or even both. It is also possible that some rows and/or columns of the input matrix are not found to be part of any ISA biclusters. Depending on the stringency parameters, it might even happen that ISA does not find any biclusters.
ISA row and column scores
ISA biclusters are not only soft, but every row and column in a given bicluster has a score, a number between minus one and one. The further this number is from zero, then stronger is the association of the given row or column to the bicluster.
How ISA works
ISA works in an iterative way. For an E (m\times n)
input matrix it starts from seed vector r_0
, which is
typically a sparse 0/1 vector of length m
. This defines a set of
rows in E
. Then E'
is multiplied by r_0
and the
result is thresholded. (Please see also ‘Normalization’ below.)
The thresholding is an important step of the ISA, without thresholding
ISA would be equivalent to a (not too effective) numerical singular
value decomposition (SVD). Currently thresholding is done by
calculating the mean and standard deviation of the vector and keeping
only elements that are further than a given number of standard
deviations from the mean. Based on the direction
parameter,
this means 1) keeping values that are significantly higher than the
mean (direction="up"
), significantly lower
(direction="down"
) or both
(direction="updown"
).
The thresholded vector c_0
is the (column)
‘signature’ of r_0
. Then the (row) signature of
c_0
is calculated, E
is multiplied by
c_0
and then thresholded to get r_1
.
This iteration is performed until it converges, i.e. r_i
and r_{i-1}
are “close”, and c_i
and
c_{i-1}
are also close. The convergence criteria,
i.e. what “close” means is by default defined by high Pearson
correlation.
It is very possible that the ISA finds the same modules more than once;
two or more seeds might converge to the same module. The function
isa.unique
eliminates every module from the result of
isa.iterate
that is very similar (in terms of
Pearson correlation) to the one that was already found before it.
Parameters
The two main parameters of ISA are the two thresholds (one for the rows and one for the columns). They basically define the stringency of the modules. If the row threshold is high, then the modules will have very similar rows. If it is mild, then modules will be bigger, with less similar rows than in the first case.
Random seeding and smart seeding
By default (i.e. if the isa
function is used) the ISA is
performed from random sparse starting seeds, generated by
generate.seeds
. This way the algorithm is
completely unsupervised, but also stochastic: it might give different
results for different runs.
It is possible to use non-random seeds as well, if you have some
knowledge about the data or are interested in a particular subset of
rows/columns, then you can feed in your seeds into the
isa.iterate
function directly. In this case the
algorithm is deterministic, for the same seed you will always get the
same results.
Normalization
On in silico data we observed that ISA has the best performance if the
input matrix is normalized (see isa.normalize
). The
normalization produces two matrices: E_r
and
E_c
. E_r
is calculated by transposing E
and
centering and scaling its rows (see scale
). E_c
is
calculated by centering and scaling the rows of E
. E_r
is
used to calculate the column signature of rows and E_c
is used
to calculate the signature of the columns.
It is possible to use another normalization, then the user is
requested to supply the normalized input data in a named list,
including the two matrices of appropriate
dimensions. ‘Er
’ will be used for calculating the
signature of the rows, ‘Ec
’ the signature of the
columns. If you want to use the same matrix in both steps, then supply
it twice, the first one transposed.
Robustness
As ISA is an unsupervised algorithm, it may very well find some modules, even if you feed in noise as an input matrix. To avoid these spurious modules we defined a robustness measure, a single number for a modules that gives how well the rows and the columns are correlated.
It recommended that the user uses isa.filter.robust
to
run ISA on the scrambled input matrix with the same threshold
parameters and then drop every module, which has a robustness score
lower than the highest robustness score among modules found in the
scrambled data.
A typical ISA work flow
Please see the manual page and the source code of isa
for a typical ISA work flow. (You can obtain the source code by typing
‘isa
’ (without the apostrophes) into your R prompt and
pressing ENTER.)
Author(s)
Gabor Csardi Gabor.Csardi@unil.ch
References
Bergmann S, Ihmels J, Barkai N: Iterative signature algorithm for the analysis of large-scale gene expression data Phys Rev E Stat Nonlin Soft Matter Phys. 2003 Mar;67(3 Pt 1):031902. Epub 2003 Mar 11.
Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N: Revealing modular organization in the yeast transcriptional network Nat Genet. 2002 Aug;31(4):370-7. Epub 2002 Jul 22
Ihmels J, Bergmann S, Barkai N: Defining transcription modules using large-scale gene expression data Bioinformatics 2004 Sep 1;20(13):1993-2003. Epub 2004 Mar 25.
See Also
The vignette in the package and isa
for running ISA.