DNA {mlbench} | R Documentation |
Primate splice-junction gene sequences (DNA)
Description
It consists of 3,186 data points (splice junctions). The data points are described by 180 indicator binary variables and the problem is to recognize the 3 classes (ei, ie, neither), i.e., the boundaries between exons (the parts of the DNA sequence retained after splicing) and introns (the parts of the DNA sequence that are spliced out).
The StaLog dna dataset is a processed version of the Irvine database described below. The main difference is that the symbolic variables representing the nucleotides (only A,G,T,C) were replaced by 3 binary indicator variables. Thus the original 60 symbolic attributes were changed into 180 binary attributes. The names of the examples were removed. The examples with ambiguities were removed (there was very few of them, 4). The StatLog version of this dataset was produced by Ross King at Strathclyde University. For original details see the Irvine database documentation.
The nucleotides A,C,G,T were given indicator values as follows:
A -> 1 0 0 | |
C -> 0 1 0 | |
G -> 0 0 1 | |
T -> 0 0 0 | |
Hint. Much better performance is generally observed if attributes closest to the junction are used. In the StatLog version, this means using attributes A61 to A120 only.
Usage
data("DNA", package = "mlbench")
Format
A data frame with 3,186 observations on 180 variables, all nominal and a target class.
Source
Source:
- all examples taken from Genbank 64.1 (ftp site: genbank.bio.net)
- categories "ei" and "ie" include every "split-gene" for primates in Genbank 64.1
- non-splice examples taken from sequences known not to include a splicing site
Donor: G. Towell, M. Noordewier, and J. Shavlik, towell,shavlik@cs.wisc.edu, noordewi@cs.rutgers.edu
These data have been taken from:
ftp.stams.strath.ac.uk/pub/Statlog
and were converted to R format by Evgenia Dimitriadou.
References
machine learning:
– M. O. Noordewier and G. G. Towell and J. W. Shavlik, 1991;
"Training Knowledge-Based Neural Networks to Recognize Genes in
DNA Sequences". Advances in Neural Information Processing Systems,
volume 3, Morgan Kaufmann.
– G. G. Towell and J. W. Shavlik and M. W. Craven, 1991; "Constructive Induction in Knowledge-Based Neural Networks", In Proceedings of the Eighth International Machine Learning Workshop, Morgan Kaufmann.
– G. G. Towell, 1991; "Symbolic Knowledge and Neural Networks: Insertion, Refinement, and Extraction", PhD Thesis, University of Wisconsin - Madison.
– G. G. Towell and J. W. Shavlik, 1992; "Interpretation of Artificial Neural Networks: Mapping Knowledge-based Neural Networks into Rules", In Advances in Neural Information Processing Systems, volume 4, Morgan Kaufmann.
Examples
data("DNA", package = "mlbench")
summary(DNA)