R: Dissimilarity score

dis {bios2mds}

R Documentation

Dissimilarity score

Description

Computes the dissimilarity score between two aligned amino acid sequences, based on substitution matrices.

Usage

dis(seq1, seq2, sub.mat.id = "PAM250", gap = NULL)

Arguments

`seq1`	a character vector representing a first amino acid sequence.
`seq2`	a character vector representing a second amino acid sequence.
`sub.mat.id`	a string of characters indicating the amino acid substitution matrix to be taken into account for calculation. This should be one of "PAM40", "PAM80", "PAM120", "PAM160", "PAM250", "BLOSUM30", "BLOSUM45", "BLOSUM62", "BLOSUM80", "GONNET", "JTT", "JTT_TM" and "PHAT". Default is PAM250. See `sub.mat`.
`gap`	a numeric vector of length 2, indicating the penalty for initiating and extending a strand of two gaps (gap ignored, default).

Details

Grishin and Grishin (2002) developped a method to calculate the similarity score with amino acid substitution matrices.

Let s be an amino acid substitution matrix with elements s(a, b), let A be an alignment of n sequences, A_{ik} is a symbol (amino acid or gap: '-') in the site k of the sequence i. For each pair of sequences i and j from A, the following equations(1), (2), (3) and (4) are calculated as follows:

(1)

S_{ij}, called score per site, is obtained as:

S_{ij} = \sum_{k \in K_{ij}}s(A_{ik}, A_{jk})/l(K_{ij})

where K_{ik} is the set of sites k such that A_{ik} != '-' and A_{jk} != '-' and l(K_{ij}) is the number of elements in K_{ij}.

(2)

T_{ij}, called average upper limit of the score per site, is obtained as:

T_{ij} = 0.5 \sum_{k \in K_{ij}}(s(A_{ik}, A_{ik}) + s(A_{jk}, A_{jk}))/l(K_{ij})

(3)

S^{rand}_{ij}, called score per site expected from random sequences, is obtained as:

S^{rand}_{ij} = \sum^{20}_{a=1}\sum^{20}_{b=1}f^i_j(a)f^j_i(b)s(a, b)

where f^i_j(a) is the frequency of amino acid 'a' in i^{th} protein sequence of A over all sites in K_{ij}.

(4)

V_{ij}, called normalized score (Feng and Doolittle, 1997), is obtained as:

V_{ij} = \frac{S_{ij} - S^{rand}_{ij}}{T_{ij} - S^{rand}_{ij}}

The normalized score V_{ij} ranges from 0 (for random sequences) to 1 (for identical sequences). However, for very divergent sequences, V_{ij} can become negative due to statistical errors. In this case, dis attributes 0 to negative scores.

The dissimilarity score D_{ij} between sequences i and j is obtained from the similarity score as:

D_{ij} = V_{ij} - 1

Value

A single numeric value representing the dissimilarity score.

Author(s)

Julien Pele

References

Grishin VN and Grishin NV (2002) Euclidian space and grouping of biological objects. Bioinformatics 18:1523-1534.

Feng DF and Doolittle RF (1997) Converting amino acid alignment scores into measures of evolutionary time: a simulation study of various relationships. J Mol Evol 44:361-370.

Examples

# calculating dis between the sequences of CLTR1_HUMAN and CLTR2_HUMAN:
aln <- import.fasta(system.file("msa/human_gpcr.fa", package = "bios2mds"))
dis <- dis(aln$CLTR1_HUMAN, aln$CLTR2_HUMAN)
dis

[Package bios2mds version 1.2.3 Index]