PredCRG_data {PredCRG} | R Documentation |
Training dataset of the PredCRG model.
Description
The dataset that has been used to train the PredCRG model contains four sub-datasets (Q1, Q2, Q3 and Q4) which are prepared based on the homogeneity of sequence length. The positive sets of the sub-datasets are denoted as pos_Q1, pos_Q2, pos_Q3 and pos_Q4 respectively, whereas the negative sets as neg_Q1, neg_Q2, neg_Q3 and neq_Q4 respectively. Further, same number of sequences are there in both positive and negative sets in each sub-dataset. More clearly, 1588, 1596, 1593 and 1365 sequences are present for both positive and negative sets for Q1, Q2, Q3 and Q4 sub-datasets respectively. Further, the range of the length of the sequences for pos_Q1, pos_Q2, pos_Q3 and pos_Q4 are 39-221, 221-363, 363-538, 538-1000 amino acids respectively, and the range of the length of the sequences for neg_Q1, neg_Q2, neg_Q3 and neg_Q4 are 43-407, 407-485, 485-607 and 607-1000 amino acids respectively. In this dataset, only the Q1 sub-dataset is available due to constraint of space in CRAN. However, one can get all the four sub-datasets from GitHub repository (https://github.com/meher861982/PredCRG_dataset ).
Usage
data("PredCRG_data")
Format
The datasets are in AAStringSet
format, which can be obtained by reading the FASTA file using readAAStringSet
function availbale in Biostrings
package.
Details
The protein sequences encoded by the circadian genes contitutes the positive datasets, whereas a randomly selected dataset from the Uniprot for the clad Viridi plantae constitutes the negative dataset.
Source
The circadian gene sequecnces are collected from the circadian gene database accessible at http://cgdb.biocuckoo.org/ .
See Also
PredCRG, PredCRG_Enc, PredCRG_training,model1, model2,model3,model4
Examples
data(PredCRG_data)
pos_Q1 <- PredCRG_data$pos_Q1 #positive set of Q1 dataset
neg_Q1 <- PredCRG_data$neg_Q1 #negative set of Q1 dataset