extract_features {LncFinder} | R Documentation |
Extract the Features
Description
This function can construct the dataset. This function is only used
to extract the features, please use function build_model
to build
new models.
Usage
extract_features(
Sequences,
label = NULL,
SS.features = FALSE,
format = "DNA",
frequencies.file = "human",
parallel.cores = 2
)
Arguments
Sequences |
mRNA sequences or long non-coding sequences. Can be a FASTA
file loaded by |
label |
Optional. String. Indicate the label of the sequences such as "NonCoding", "Coding". |
SS.features |
Logical. If |
format |
String. Can be |
frequencies.file |
String or a list obtained from function
|
parallel.cores |
Integer. The number of cores for parallel computation.
By default the number of cores is |
Details
This function extracts the features and constructs the dataset.
Considering that it is time consuming to obtain secondary structure sequences,
users can build the model only with features of sequence and EIIP
(SS.features = FALSE
). When SS.features = TRUE
, Sequences
should be secondary structure sequences (Dot-Bracket Notation) obtained from
function run_RNAfold
and parameter format
should be set
as "SS"
.
Please note that:
Secondary structure features (SS.features
) can improve the performance
when the species of unevaluated sequences is identical to the species of the
sequences that used to build the model.
However, if users are trying to predict sequences with the model trained on
other species, SS.features
as TRUE
may lead to low accuracy.
Value
Returns a data.frame. 11 features when SS.features
is FALSE
,
and 19 features when SS.features
is TRUE
.
Features
1. Features based on sequence:
The length and coverage of the longest ORF (ORF.Max.Len
and
ORF.Max.Cov
);
Log-Distance.lncRNA (Seq.lnc.Dist
);
Log-Distance.protein-coding transcripts (Seq.pct.Dist
);
Distance-Ratio.sequence (Seq.Dist.Ratio
).
2. Features based on EIIP (electron-ion interaction pseudopotential) value:
Signal at 1/3 position (Signal.Peak
);
Signal to noise ratio (SNR
);
the minimum value of the top 10% power spectrum (Signal.Min
);
the quantile Q1 and Q2 of the top 10% power spectrum (Singal.Q1
and Signal.Q2
)
the maximum value of the top 10% power spectrum (Signal.Max
).
3. Features based on secondary structure sequence:
Log-Distance.acguD.lncRNA (Dot_lnc.dist
);
Log-Distance.acguD.protein-coding transcripts (Dot_pct.dist
);
Distance-Ratio.acguD (Dot_Dist.Ratio
);
Log-Distance.acgu-ACGU.lncRNA (SS.lnc.dist
);
Log-Distance.acgu-ACGU.protein-coding transcripts (SS.pct.dist
);
Distance-Ratio.acgu-ACGU (SS.Dist.Ratio
);
Minimum free energy (MFE
);
Percentage of Unpair-Pair (UP.PCT
)
References
Siyu Han, Yanchun Liang, Qin Ma, Yangyi Xu, Yu Zhang, Wei Du, Cankun Wang & Ying Li. LncFinder: an integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information, and physicochemical property. Briefings in Bioinformatics, 2019, 20(6):2009-2027.
Author(s)
HAN Siyu
See Also
svm_tune
, build_model
,
make_frequencies
, run_RNAfold
, read_SS
.
Examples
## Not run:
data(demo_DNA.seq)
Seqs <- demo_DNA.seq
### Extract features with pre-build frequencies.file:
my_features <- extract_features(Seqs, label = "Class.of.the.Sequences",
SS.features = FALSE, format = "DNA",
frequencies.file = "mouse",
parallel.cores = 2)
### Use your own frequencies file by assign frequencies list to parameter
### "frequencies.file".
## End(Not run)