R: Extract the Features

extract_features {LncFinder}

R Documentation

Extract the Features

Description

This function can construct the dataset. This function is only used to extract the features, please use function build_model to build new models.

Usage

extract_features(
  Sequences,
  label = NULL,
  SS.features = FALSE,
  format = "DNA",
  frequencies.file = "human",
  parallel.cores = 2
)

Arguments

`Sequences`	mRNA sequences or long non-coding sequences. Can be a FASTA file loaded by `seqinr-package` or secondary structure sequences (Dot-Bracket Notation) obtained from function `run_RNAfold`. If `Sequences` are secondary structure sequences file, parameter `format` should be defined as `"SS"`.
`label`	Optional. String. Indicate the label of the sequences such as "NonCoding", "Coding".
`SS.features`	Logical. If `SS.features = TRUE`, secondary structure features will be extracted. In this case, `Sequences` should be secondary structure sequences (Dot-Bracket Notation) obtained from function `run_RNAfold` and parameter `format` should be set as `"SS"`.
`format`	String. Can be `"DNA"` or `"SS"`. Define the format of `Sequences`. `"DNA"` for DNA sequences and `"SS"` for secondary structure sequences. This parameter must be set as `"SS"` when `SS.features = TURE`.
`frequencies.file`	String or a list obtained from function `make_frequencies`. Input species name `"human"`, `"mouse"` or `"wheat"` to use pre-build frequencies files. Or assign a users' own frequencies file (See function `make_frequencies`).
`parallel.cores`	Integer. The number of cores for parallel computation. By default the number of cores is `2`. Users can set as `-1` to run this function with all cores.

Details

This function extracts the features and constructs the dataset.

Considering that it is time consuming to obtain secondary structure sequences, users can build the model only with features of sequence and EIIP (SS.features = FALSE). When SS.features = TRUE, Sequences should be secondary structure sequences (Dot-Bracket Notation) obtained from function run_RNAfold and parameter format should be set as "SS".

Please note that:

Secondary structure features (SS.features) can improve the performance when the species of unevaluated sequences is identical to the species of the sequences that used to build the model.

However, if users are trying to predict sequences with the model trained on other species, SS.features as TRUE may lead to low accuracy.

Value

Returns a data.frame. 11 features when SS.features is FALSE, and 19 features when SS.features is TRUE.

Features

1. Features based on sequence:

The length and coverage of the longest ORF (ORF.Max.Len and ORF.Max.Cov);

Log-Distance.lncRNA (Seq.lnc.Dist);

Log-Distance.protein-coding transcripts (Seq.pct.Dist);

Distance-Ratio.sequence (Seq.Dist.Ratio).

2. Features based on EIIP (electron-ion interaction pseudopotential) value:

Signal at 1/3 position (Signal.Peak);

Signal to noise ratio (SNR);

the minimum value of the top 10% power spectrum (Signal.Min);

the quantile Q1 and Q2 of the top 10% power spectrum (Singal.Q1 and Signal.Q2)

the maximum value of the top 10% power spectrum (Signal.Max).

3. Features based on secondary structure sequence:

Log-Distance.acguD.lncRNA (Dot_lnc.dist);

Log-Distance.acguD.protein-coding transcripts (Dot_pct.dist);

Distance-Ratio.acguD (Dot_Dist.Ratio);

Log-Distance.acgu-ACGU.lncRNA (SS.lnc.dist);

Log-Distance.acgu-ACGU.protein-coding transcripts (SS.pct.dist);

Distance-Ratio.acgu-ACGU (SS.Dist.Ratio);

Minimum free energy (MFE);

Percentage of Unpair-Pair (UP.PCT)

References

Siyu Han, Yanchun Liang, Qin Ma, Yangyi Xu, Yu Zhang, Wei Du, Cankun Wang & Ying Li. LncFinder: an integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information, and physicochemical property. Briefings in Bioinformatics, 2019, 20(6):2009-2027.

Author(s)

HAN Siyu

Examples

## Not run: 
data(demo_DNA.seq)
Seqs <- demo_DNA.seq

### Extract features with pre-build frequencies.file:
my_features <- extract_features(Seqs, label = "Class.of.the.Sequences",
                                SS.features = FALSE, format = "DNA",
                                frequencies.file = "mouse",
                                parallel.cores = 2)

### Use your own frequencies file by assign frequencies list to parameter
### "frequencies.file".

## End(Not run)

[Package LncFinder version 1.1.5 Index]