| .names file created by George John, October 1994 | Processing: | * A,C,T,G -> 001,010,100,000 Seems biased against systems that can handle | categorical attributes | | | |1. TITLE: | DNA Dataset (STATLOG version) - Primate splice-junction gene sequences (DNA) | with associated imperfect domain theory | | PROBLEM DESCRIPTION | Splice junctions are points on a DNA sequence at which `superfluous' DNA is | removed during the process of protein creation in higher organisms. The | problem posed in this dataset is to recognize, given a sequence of DNA, the | boundaries between exons (the parts of the DNA sequence retained after | splicing) and introns (the parts of the DNA sequence that are spliced | out). | | PURPOSE | This problem consists of two subtasks: recognizing exon/intron | boundaries (referred to as EI sites), and recognizing intron/exon boundaries | (IE sites). (In the biological community, IE borders are referred to | a ``acceptors'' while EI borders are referred to as ``donors''.) | |2. USE IN STATLOG | | 2.1- Testing Mode | Train & Test | | 2.2- Special Preprocessing | Yes | | 2.3- Test Results | | Algorithm Sucess Rate | ========= =========== | Radial 95.90 | Dipol92 95.200 | Alloc80 94.300 | QuaDisc 94.100 | Discrim 94.100 | LogDisc 93.900 | Bayes 93.200 | Castle 92.800 | IndCart 92.700 | C4.5 92.400 | Cart 91.500 | BackProp 91.200 | BayTree 90.500 | Cn2 90.500 | Ac2 90.000 | NewId 90.000 | Cal5 86.900 | Itrule 86.500 | Smart 85.900 | KNN 84.500 | Kohonen 66.10 | Default 52.000 | LVQ 0.000 | Cascade 0.000 | |3. SOURCES and PAST USAGE | 3.1 SOURCES | (a) Creators: | - all examples taken from Genbank 64.1 (ftp site: genbank.bio.net) | - categories "ei" and "ie" include every "split-gene" | for primates in Genbank 64.1 | - non-splice examples taken from sequences known not to include | a splicing site | (b) Donor: G. Towell, M. Noordewier, and J. Shavlik, | {towell,shavlik}@cs.wisc.edu, noordewi@cs.rutgers.edu | (c) Date received: 1/1/92 | | The StaLog dna dataset is a processed vesrion of the Irvine | database described below. The main difference is that the | symbolic variables representing the nucleotides (only A,G,T,C) | were replaced by 3 binary indicator variables. Thus the original | 60 symbolic attributes were changed into 180 binary attributes. | The names of the examples were removed. The examples with | ambiguities were removed (there was very few of them, 4). | The StatLog version of this dataset was produced by Ross King | at Strathclyde University. For original details see the Irvine | database documantation. | | The nucleotides A,C,G,T were given indicator values as follows | | A -> 1 0 0 | C -> 0 1 0 | G -> 0 0 1 | T -> 0 0 0 | | The class values are | ei -> 1 | ie -> 2 | n -> 3 | 3.2 PAST USAGE | | (a) machine learning: | -- M. O. Noordewier and G. G. Towell and J. W. Shavlik, 1991; | "Training Knowledge-Based Neural Networks to Recognize Genes in | DNA Sequences". Advances in Neural Information Processing Systems, | volume 3, Morgan Kaufmann. | | -- G. G. Towell and J. W. Shavlik and M. W. Craven, 1991; | "Constructive Induction in Knowledge-Based Neural Networks", | In Proceedings of the Eighth International Machine Learning | Workshop, Morgan Kaufmann. | | -- G. G. Towell, 1991; | "Symbolic Knowledge and Neural Networks: Insertion, Refinement, and | Extraction", PhD Thesis, University of Wisconsin - Madison. | | -- G. G. Towell and J. W. Shavlik, 1992; | "Interpretation of Artificial Neural Networks: Mapping | Knowledge-based Neural Networks into Rules", In Advances in Neural | Information Processing Systems, volume 4, Morgan Kaufmann. | | (b) attributes predicted: given a position in the middle of a window | 60 DNA sequence elements (called "nucleotides" or "base-pairs"), | decide if this is a | a) "intron -> exon" boundary (ie) [These are sometimes called "donors"] | b) "exon -> intron" boundary (ei) [These are sometimes called "acceptors"] | c) neither (n) | (c) Results of study indicated that machine learning techniques (neural | networks, nearest neighbor, contributors' KBANN system) performed as | well/better than classification based on canonical pattern matching | (method used in biological literature). | | HISTORY | This dataset has been developed to help evaluate a "hybrid" learning | algorithm (KBANN) that uses examples to inductively refine preexisting | knowledge. Using a "ten-fold cross-validation" methodology on 1000 | examples randomly selected from the complete set of 3190, the following | error rates were produced by various ML algorithms (all experiments | run at the Univ of Wisconsin, sometimes with local implementations | of published algorithms). | | System Neither EI IE | ---------- ------- ----- ----- | KBANN 4.62 7.56 8.47 | BACKPROP 5.29 5.74 10.75 | PEBLS 6.86 8.18 7.55 | PERCEPTRON 3.99 16.32 17.41 | ID3 8.84 10.58 13.99 | COBWEB 11.80 15.04 9.46 | Near. Neighbor 31.11 11.65 9.09 | | Type of domain: non-numeric, nominal (one of A, G, T, C) | |************************************************************* | |4. DATASET DISCRIPTION | NUMBER OF EXAMPLES: | 3186 | | Train 2000 | Test 1186 | | NUMBER OF CLASSES: | 3 (one of 1,2,3) | | Distribution of classes | Class Train Test | ------------------------------------ | 1 464 (23.20%) 303 (25.55%) | 2 485 (24.25%) 280 (23.61%) | 3 1051 (52.55%) 603 (50.84%) | | NUMBER OF ATTRIBUTES: | 180 binary indicator variables | | Hint. Much better performance is generally observed if attributes | closest to the junction are used. | In the StatLog version, this means using | attributes A61 to A120 only. | | |CONTACTS | statlog-adm@ncc.up.pt | bob@stams.strathclyde.ac.uk | | |================================================================================ |;little lisp function to generate names: |(defun atts () | (let ((i 1)) | (while (<= i 180) | (insert (format "A%s: continuous.\n" i)) | (setq i (+ 1 i))))) 1,2,3. | classes A0: continuous. A1: continuous. A2: continuous. A3: continuous. A4: continuous. A5: continuous. A6: continuous. A7: continuous. A8: continuous. A9: continuous. A10: continuous. A11: continuous. A12: continuous. A13: continuous. A14: continuous. A15: continuous. A16: continuous. A17: continuous. A18: continuous. A19: continuous. A20: continuous. A21: continuous. A22: continuous. A23: continuous. A24: continuous. A25: continuous. A26: continuous. A27: continuous. A28: continuous. A29: continuous. A30: continuous. A31: continuous. A32: continuous. A33: continuous. A34: continuous. A35: continuous. A36: continuous. A37: continuous. A38: continuous. A39: continuous. A40: continuous. A41: continuous. A42: continuous. A43: continuous. A44: continuous. A45: continuous. A46: continuous. A47: continuous. A48: continuous. A49: continuous. A50: continuous. A51: continuous. A52: continuous. A53: continuous. A54: continuous. A55: continuous. A56: continuous. A57: continuous. A58: continuous. A59: continuous. A60: continuous. A61: continuous. A62: continuous. A63: continuous. A64: continuous. A65: continuous. A66: continuous. A67: continuous. A68: continuous. A69: continuous. A70: continuous. A71: continuous. A72: continuous. A73: continuous. A74: continuous. A75: continuous. A76: continuous. A77: continuous. A78: continuous. A79: continuous. A80: continuous. A81: continuous. A82: continuous. A83: continuous. A84: continuous. A85: continuous. A86: continuous. A87: continuous. A88: continuous. A89: continuous. A90: continuous. A91: continuous. A92: continuous. A93: continuous. A94: continuous. A95: continuous. A96: continuous. A97: continuous. A98: continuous. A99: continuous. A100: continuous. A101: continuous. A102: continuous. A103: continuous. A104: continuous. A105: continuous. A106: continuous. A107: continuous. A108: continuous. A109: continuous. A110: continuous. A111: continuous. A112: continuous. A113: continuous. A114: continuous. A115: continuous. A116: continuous. A117: continuous. A118: continuous. A119: continuous. A120: continuous. A121: continuous. A122: continuous. A123: continuous. A124: continuous. A125: continuous. A126: continuous. A127: continuous. A128: continuous. A129: continuous. A130: continuous. A131: continuous. A132: continuous. A133: continuous. A134: continuous. A135: continuous. A136: continuous. A137: continuous. A138: continuous. A139: continuous. A140: continuous. A141: continuous. A142: continuous. A143: continuous. A144: continuous. A145: continuous. A146: continuous. A147: continuous. A148: continuous. A149: continuous. A150: continuous. A151: continuous. A152: continuous. A153: continuous. A154: continuous. A155: continuous. A156: continuous. A157: continuous. A158: continuous. A159: continuous. A160: continuous. A161: continuous. A162: continuous. A163: continuous. A164: continuous. A165: continuous. A166: continuous. A167: continuous. A168: continuous. A169: continuous. A170: continuous. A171: continuous. A172: continuous. A173: continuous. A174: continuous. A175: continuous. A176: continuous. A177: continuous. A178: continuous. A179: continuous.