R: Transformation of nucleic acid sequences into numeric vectors...

POS.Feature {EncDNA}

R Documentation

Transformation of nucleic acid sequences into numeric vectors using position-wise frequency of nucleotides.

Description

This encoding scheme was devised by Li et al. (2012). Frequencies of 4 nucleotides are first computed at each position for both positive and negative datasets, resulting in two 4*L probability tables for the two classes for sequence length L. A 4*L statistical difference table is obtained by elementwise substraction of the two probability distribution tables, which is then used for encoding of sequences. Further, as per sparse encoding, the nucleotides A, T, G and C can be encoded as (1,0,0,0), (0,1,0,0), (0,0,1,0) and (0,0,0,1) respectively. The value 1 of sparse encoding is then replaced with the difference values obtained from the difference table for encoding nucleotide at each postion. Thus, it can be said that POS feature encoding is a blending of MN-FDTF (Huang et al., 2006) and Sparse encoding (Meher et al., 2016) technique.

Usage

POS.Feature(positive_class, negative_class, test_seq)

Arguments

`positive_class`	Sequence dataset of the positive class, must be an object of class `DNAStringSet`.
`negative_class`	Sequence dataset of the negative class, must be an object of class `DNAStringSet`.
`test_seq`	Sequences to be encoded into numeric vectors, must be an object of class `DNAStringSet`.

Details

The DNAstringSet object can be obtained by reading the sequences in FASTA format using the function readDNAStringSetavailable in the Biostrings package of Bioconductor.

Value

A numeric matrix of order m*4n, where m is the number of sequences in test_seq and n is the length of sequence.

Note

In this encoding procedure, dependencies of nucleotides are not taken into consideration. Both positive and negative datasets are required for encoding of nucleotide sequences. Each sequence of length L can be transformed into a numeric vector of length 4*L with this encoding technique.

Author(s)

Prabina Kumar Meher, Indian Agricultural Statistics Research Institute, New Delhi-110012, INDIA

References

Huang, J., Li, T., Chen, K. and Wu, J. (2006). An approach of encoding for prediction of splice sites using SVM. Biochimie, 88(7): 923-929.
Li, J.L., Wang, L.F., Wang, H.Y., Bai, L.Y., Yuan, Z.M. (2012). High-accuracy splice sites prediction based on sequence component and position features. Genetics and Molecular Research, 11(3): 3432-3451.
Meher, P.K., Sahu, T.K., Rao, A.R. and Wahi, S.D. (2016). A computational approach for prediction of donor splice sites with improved accuracy. Journal of Theoretical Biology, 404: 285-294.

Examples

data(droso)
positive <- droso$positive
negative <- droso$negative
test <- droso$test
pos <- positive[1:200]
neg <- negative[1:200]
tst <- test
enc <- POS.Feature(positive_class=pos, negative_class=neg, test_seq=tst)
enc

[Package EncDNA version 1.0.2 Index]