R: Training dataset of the PredCRG model.

PredCRG_data {PredCRG}

R Documentation

Training dataset of the PredCRG model.

Description

The dataset that has been used to train the PredCRG model contains four sub-datasets (Q1, Q2, Q3 and Q4) which are prepared based on the homogeneity of sequence length. The positive sets of the sub-datasets are denoted as pos_Q1, pos_Q2, pos_Q3 and pos_Q4 respectively, whereas the negative sets as neg_Q1, neg_Q2, neg_Q3 and neq_Q4 respectively. Further, same number of sequences are there in both positive and negative sets in each sub-dataset. More clearly, 1588, 1596, 1593 and 1365 sequences are present for both positive and negative sets for Q1, Q2, Q3 and Q4 sub-datasets respectively. Further, the range of the length of the sequences for pos_Q1, pos_Q2, pos_Q3 and pos_Q4 are 39-221, 221-363, 363-538, 538-1000 amino acids respectively, and the range of the length of the sequences for neg_Q1, neg_Q2, neg_Q3 and neg_Q4 are 43-407, 407-485, 485-607 and 607-1000 amino acids respectively. In this dataset, only the Q1 sub-dataset is available due to constraint of space in CRAN. However, one can get all the four sub-datasets from GitHub repository (https://github.com/meher861982/PredCRG_dataset ).

Usage

data("PredCRG_data")

Format

The datasets are in AAStringSet format, which can be obtained by reading the FASTA file using readAAStringSet function availbale in Biostrings package.

Details

The protein sequences encoded by the circadian genes contitutes the positive datasets, whereas a randomly selected dataset from the Uniprot for the clad Viridi plantae constitutes the negative dataset.

Source

The circadian gene sequecnces are collected from the circadian gene database accessible at http://cgdb.biocuckoo.org/ .

Examples


data(PredCRG_data)

pos_Q1 <- PredCRG_data$pos_Q1 #positive set of Q1 dataset
neg_Q1 <- PredCRG_data$neg_Q1 #negative set of Q1 dataset

[Package PredCRG version 1.0.2 Index]