R: Transforming nucleotide sequences into numeric vectors using...

MM1.Feature {EncDNA}

R Documentation

Transforming nucleotide sequences into numeric vectors using first order nucleotide dependency.

Description

The concept of sequence encoding using Markov model (1^{st} order) was introduced by Ho and Rajapakse (2005) for prediction of splice sites. However, this encoding scheme has been comprehensively used by Baten et al. (2006) for prediction of splice sites. In this encoding procedure, first order dependencies between nucleotides in nucleotide sequence are accounted. Only the postive class dataset is used for estimation of dependencies in terms of probabilities, which are then used for encoding.

Usage

MM1.Feature(positive_class, test_seq)

Arguments

`positive_class`	Sequence dataset of the positive class, must be an object of class `DNAStringSet`.
`test_seq`	Sequences to be encoded into numeric vectors, must be an object of class `DNAStringSet`.

Details

The FASTA sequences should be read into R using the function readDNAStringSet available in Biostrings package. This encoding is similar to PN.FDTF feature, as far as the dependency among nucleotides in a sequence is concerned. The only difference is the use of positive class only in stead of both positive and negative classes in PN.FDTF. This encoding approach has similarity with WAM features (Meher et al. 2016) in which the dinucleotide dependencies are considered.

Value

A numeric matrix of order m*(n-1), where m is the number of sequences in test_seq and n is the length of sequence.

Author(s)

Prabina Kumar Meher, Indian Agricultural Statistics Research Institute, New Delhi-110012, INDIA

References

Rajapakse, J. and Ho, L.S. (2005). Markov encoding for detecting signals in genomic sequences. IEEE/ACM Trans Comput Biol Bioinf., 2(2): 131-142.
Baten, A., Chang, B., Halgamuge, S. and Li, J. (2006) Splice site identification using probabilistic parameters and SVM classification. BMC Bioinformatics, 7(Suppl 5): S15.
Meher, P.K., Sahu, T.K., Rao, A.R. and Wahi, S.D. (2016). Identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features. Algorithms for Molecular Biology, 11(1), 16.

Examples

data(droso)
positive <- droso$positive
test <- droso$test
pos <- positive[1:200]
tst <- test
enc <- MM1.Feature(positive_class=pos, test_seq=tst)
enc

[Package EncDNA version 1.0.2 Index]