POS.Feature {EncDNA} | R Documentation |
Transformation of nucleic acid sequences into numeric vectors using position-wise frequency of nucleotides.
Description
This encoding scheme was devised by Li et al. (2012). Frequencies of 4 nucleotides are first computed at each position for both positive and negative datasets, resulting in two 4*L
probability tables for the two classes for sequence length L
. A 4*L
statistical difference table is obtained by elementwise substraction of the two probability distribution tables, which is then used for encoding of sequences. Further, as per sparse encoding, the nucleotides A, T, G and C can be encoded as (1,0,0,0), (0,1,0,0), (0,0,1,0) and (0,0,0,1) respectively. The value 1 of sparse encoding is then replaced with the difference values obtained from the difference table for encoding nucleotide at each postion. Thus, it can be said that POS feature encoding is a blending of MN-FDTF (Huang et al., 2006) and Sparse encoding (Meher et al., 2016) technique.
Usage
POS.Feature(positive_class, negative_class, test_seq)
Arguments
positive_class |
Sequence dataset of the positive class, must be an object of class |
negative_class |
Sequence dataset of the negative class, must be an object of class |
test_seq |
Sequences to be encoded into numeric vectors, must be an object of class |
Details
The DNAstringSet
object can be obtained by reading the sequences in FASTA format using the function readDNAStringSetavailable in the Biostrings package of Bioconductor.
Value
A numeric matrix of order m*4n
, where m
is the number of sequences in test_seq
and n
is the length of sequence.
Note
In this encoding procedure, dependencies of nucleotides are not taken into consideration. Both positive and negative datasets are required for encoding of nucleotide sequences. Each sequence of length L
can be transformed into a numeric vector of length 4*L
with this encoding technique.
Author(s)
Prabina Kumar Meher, Indian Agricultural Statistics Research Institute, New Delhi-110012, INDIA
References
Huang, J., Li, T., Chen, K. and Wu, J. (2006). An approach of encoding for prediction of splice sites using SVM. Biochimie, 88(7): 923-929.
Li, J.L., Wang, L.F., Wang, H.Y., Bai, L.Y., Yuan, Z.M. (2012). High-accuracy splice sites prediction based on sequence component and position features. Genetics and Molecular Research, 11(3): 3432-3451.
Meher, P.K., Sahu, T.K., Rao, A.R. and Wahi, S.D. (2016). A computational approach for prediction of donor splice sites with improved accuracy. Journal of Theoretical Biology, 404: 285-294.
See Also
MN.Fdtf.Feature
, Bayes.Feature
, WMM.Feature
Examples
data(droso)
positive <- droso$positive
negative <- droso$negative
test <- droso$test
pos <- positive[1:200]
neg <- negative[1:200]
tst <- test
enc <- POS.Feature(positive_class=pos, negative_class=neg, test_seq=tst)
enc