make_frequencies {LncFinder} | R Documentation |
Make the frequencies file for new classifier construction
Description
This function is used to calculate the frequencies of lncRNAs, CDs, and
secondary structure sequences. The frequencies file can be used to build the classifier
using function extract_features
. Functions make_frequencies
and
extract_features
are useful when users are trying
to build their own model.
NOTE: Function make_frequencies
makes the frequencies file
for building the classifiers of LncFinder method. If users need to calculate Logarithm-Distance,
Euclidean-Distance, and hexamer score, the frequencies file need to be computed using function
make_referFreq
.
Usage
make_frequencies(
cds.seq,
mRNA.seq,
lncRNA.seq,
SS.features = FALSE,
cds.format = "DNA",
lnc.format = "DNA",
check.cds = TRUE,
ignore.illegal = TRUE
)
Arguments
cds.seq |
Coding sequences (mRNA without UTRs). Can be a FASTA file loaded
by |
mRNA.seq |
mRNA sequences with Dot-Bracket Notation. The secondary
structure sequences can be obtained from function |
lncRNA.seq |
Long non-coding RNA sequences. Can be a FASTA file loaded by
|
SS.features |
Logical. If |
cds.format |
String. Define the format of the sequences of |
lnc.format |
String. Define the format of lncRNAs ( |
check.cds |
Logical. Incomplete CDs can lead to a false shift and a
inaccurate hexamer frequencies. When |
ignore.illegal |
Logical. If |
Details
This function is used to make frequencies file for LncFinder method. This file is needed when users are trying to build their own model.
In order to achieve high accuracy, mRNA should not be regarded as CDs and assigned
to parameter cds.seq
. However, CDs of some species may be insufficient
for calculating frequencies, and mRNAs can be regarded as CDs with parameter
check.cds = TRUE
. In this case, hexamer frequencies will be calculated
on ORF region.
Considering that it is time consuming to obtain secondary structure sequences,
users can only provide nucleotide sequences and build a model without secondary
structure features (SS.features =
FALSE
). If users want to build a model
with secondary structure features, parameter SS.features
should be set
as TRUE
. At the same time, the format of the sequences of mRNA.seq
and lnc.seq
should be secondary structure sequences (Dot-Bracket Notation).
Secondary structure sequences can be obtained by function run_RNAfold
.
Please note that:
SS.features can improve the performance when the species of unevaluated sequences is identical to the species of the sequences that used to build the model.
However, if users are trying to predict sequences with the model trained on other species, SS.features may lead to low accuracy.
The frequencies file consists three groups: Hexamer Frequencies; acgu-ACGU Frequencies and acguD Frequencies.
Hexamer Frequencies are calculated on the original nucleotide sequences by employing k-mer scheme (k = 6), and the sliding window will slide 3 nt each step.
For any secondary structure sequences (Dot-Bracket Notation), if one position is a dot, the corresponding nucleotide of the RNA sequence will be replaced with character "D". acguD Frequencies are the k-mer frequencies (k = 4) calculated on this new sequences.
Similarly, for any secondary structure sequences (Dot-Bracket Notation), if one position is "(" or ")", the corresponding nucleotide of the RNA sequence will be replaced with upper case ("A", "C", "G", "U").
A brief example,
DNA Sequence: 5'- t a c a g t t a t g -3'
RNA Sequence: 5'- u a c a g u u a u g -3'
Dot-Bracket Sequence: 5'- . . . . ( ( ( ( ( ( -3'
acguD Sequence: { D, D, D, D, g, u, u, a, u, g }
acgu-ACGU Sequence: { u, a, c, a, G, U, U, A, U, G }
Value
Returns a list which consists the frequencies of protein-coding sequences and non-coding sequences.
References
Siyu Han, Yanchun Liang, Qin Ma, Yangyi Xu, Yu Zhang, Wei Du, Cankun Wang & Ying Li. LncFinder: an integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information, and physicochemical property. Briefings in Bioinformatics, 2019, 20(6):2009-2027.
Author(s)
HAN Siyu
See Also
run_RNAfold
, read_SS
,
build_model
, extract_features
,
make_referFreq
.
Examples
### Only for examples:
data(demo_DNA.seq)
Seqs <- demo_DNA.seq
## Not run:
### Obtain the secondary structure sequences (Windows OS):
RNAfold.path <- '"E:/Program Files/ViennaRNA/RNAfold.exe"'
SS.seq <- run_RNAfold(Seqs, RNAfold.path = RNAfold.path, parallel.cores = 2)
### Make frequencies file with secondary strucutre features,
my_file_1 <- make_frequencies(cds.seq = SS.seq, mRNA.seq = SS.seq,
lncRNA.seq = SS.seq, SS.features = TRUE,
cds.format = "SS", lnc.format = "SS",
check.cds = TRUE, ignore.illegal = FALSE)
## End(Not run)
### Make frequencies file without secondary strucutre features,
my_file_2 <- make_frequencies(cds.seq = Seqs, lncRNA.seq = Seqs,
SS.features = FALSE, cds.format = "DNA",
lnc.format = "DNA", check.cds = TRUE,
ignore.illegal = FALSE)
### The input of cds.seq and lncRNA.seq can also be secondary structure
### sequences when SS.features = FALSE, such as,
data(demp_SS.seq)
SS.seq <- demo_SS.seq
my_file_3 <- make_frequencies(cds.seq = SS.seq, lncRNA.seq = Seqs,
SS.features = FALSE, cds.format = "SS",
lnc.format = "DNA", check.cds = TRUE,
ignore.illegal = FALSE)