| .names file created by George John, October 1994 | Processing: | * A,C,T,G -> 001,010,100,000 Seems biased against systems that can handle | categorical attributes | | | |1. TITLE: | DNA Dataset (STATLOG version) - Primate splice-junction gene sequences (DNA) | with associated imperfect domain theory | | PROBLEM DESCRIPTION | Splice junctions are points on a DNA sequence at which `superfluous' DNA is | removed during the process of protein creation in higher organisms. The | problem posed in this dataset is to recognize, given a sequence of DNA, the | boundaries between exons (the parts of the DNA sequence retained after | splicing) and introns (the parts of the DNA sequence that are spliced | out). | | PURPOSE | This problem consists of two subtasks: recognizing exon/intron | boundaries (referred to as EI sites), and recognizing intron/exon boundaries | (IE sites). (In the biological community, IE borders are referred to | a ``acceptors'' while EI borders are referred to as ``donors''.) | |2. USE IN STATLOG | | 2.1- Testing Mode | Train & Test | | 2.2- Special Preprocessing | Yes | | 2.3- Test Results | | Algorithm Sucess Rate | ========= =========== | Radial 95.90 | Dipol92 95.200 | Alloc80 94.300 | QuaDisc 94.100 | Discrim 94.100 | LogDisc 93.900 | Bayes 93.200 | Castle 92.800 | IndCart 92.700 | C4.5 92.400 | Cart 91.500 | BackProp 91.200 | BayTree 90.500 | Cn2 90.500 | Ac2 90.000 | NewId 90.000 | Cal5 86.900 | Itrule 86.500 | Smart 85.900 | KNN 84.500 | Kohonen 66.10 | Default 52.000 | LVQ 0.000 | Cascade 0.000 | |3. SOURCES and PAST USAGE | 3.1 SOURCES | (a) Creators: | - all examples taken from Genbank 64.1 (ftp site: genbank.bio.net) | - categories "ei" and "ie" include every "split-gene" | for primates in Genbank 64.1 | - non-splice examples taken from sequences known not to include | a splicing site | (b) Donor: G. Towell, M. Noordewier, and J. Shavlik, | {towell,shavlik}@cs.wisc.edu, noordewi@cs.rutgers.edu | (c) Date received: 1/1/92 | | The StaLog dna dataset is a processed vesrion of the Irvine | database described below. The main difference is that the | symbolic variables representing the nucleotides (only A,G,T,C) | were replaced by 3 binary indicator variables. Thus the original | 60 symbolic attributes were changed into 180 binary attributes. | The names of the examples were removed. The examples with | ambiguities were removed (there was very few of them, 4). | The StatLog version of this dataset was produced by Ross King | at Strathclyde University. For original details see the Irvine | database documantation. | | The nucleotides A,C,G,T were given indicator values as follows | | A -> 1 0 0 | C -> 0 1 0 | G -> 0 0 1 | T -> 0 0 0 | | The class values are | ei -> 1 | ie -> 2 | n -> 3 | 3.2 PAST USAGE | | (a) machine learning: | -- M. O. Noordewier and G. G. Towell and J. W. Shavlik, 1991; | "Training Knowledge-Based Neural Networks to Recognize Genes in | DNA Sequences". Advances in Neural Information Processing Systems, | volume 3, Morgan Kaufmann. | | -- G. G. Towell and J. W. Shavlik and M. W. Craven, 1991; | "Constructive Induction in Knowledge-Based Neural Networks", | In Proceedings of the Eighth International Machine Learning | Workshop, Morgan Kaufmann. | | -- G. G. Towell, 1991; | "Symbolic Knowledge and Neural Networks: Insertion, Refinement, and | Extraction", PhD Thesis, University of Wisconsin - Madison. | | -- G. G. Towell and J. W. Shavlik, 1992; | "Interpretation of Artificial Neural Networks: Mapping | Knowledge-based Neural Networks into Rules", In Advances in Neural | Information Processing Systems, volume 4, Morgan Kaufmann. | | (b) attributes predicted: given a position in the middle of a window | 60 DNA sequence elements (called "nucleotides" or "base-pairs"), | decide if this is a | a) "intron -> exon" boundary (ie) [These are sometimes called "donors"] | b) "exon -> intron" boundary (ei) [These are sometimes called "acceptors"] | c) neither (n) | (c) Results of study indicated that machine learning techniques (neural | networks, nearest neighbor, contributors' KBANN system) performed as | well/better than classification based on canonical pattern matching | (method used in biological literature). | | HISTORY | This dataset has been developed to help evaluate a "hybrid" learning | algorithm (KBANN) that uses examples to inductively refine preexisting | knowledge. Using a "ten-fold cross-validation" methodology on 1000 | examples randomly selected from the complete set of 3190, the following | error rates were produced by various ML algorithms (all experiments | run at the Univ of Wisconsin, sometimes with local implementations | of published algorithms). | | System Neither EI IE | ---------- ------- ----- ----- | KBANN 4.62 7.56 8.47 | BACKPROP 5.29 5.74 10.75 | PEBLS 6.86 8.18 7.55 | PERCEPTRON 3.99 16.32 17.41 | ID3 8.84 10.58 13.99 | COBWEB 11.80 15.04 9.46 | Near. Neighbor 31.11 11.65 9.09 | | Type of domain: non-numeric, nominal (one of A, G, T, C) | |************************************************************* | |4. DATASET DISCRIPTION | NUMBER OF EXAMPLES: | 3186 | | Train 2000 | Test 1186 | | NUMBER OF CLASSES: | 3 (one of 1,2,3) | | Distribution of classes | Class Train Test | ------------------------------------ | 1 464 (23.20%) 303 (25.55%) | 2 485 (24.25%) 280 (23.61%) | 3 1051 (52.55%) 603 (50.84%) | | NUMBER OF ATTRIBUTES: | 180 binary indicator variables | | Hint. Much better performance is generally observed if attributes | closest to the junction are used. | In the StatLog version, this means using | attributes A61 to A120 only. | | |CONTACTS | statlog-adm@ncc.up.pt | bob@stams.strathclyde.ac.uk | | |================================================================================ |;little lisp function to generate names: |(defun atts () | (let ((i 1)) | (while (<= i 180) | (insert (format "A%s: continuous.\n" i)) | (setq i (+ 1 i))))) 1,2,3. | classes A0: 0,1. A1: 0,1. A2: 0,1. A3: 0,1. A4: 0,1. A5: 0,1. A6: 0,1. A7: 0,1. A8: 0,1. A9: 0,1. A10: 0,1. A11: 0,1. A12: 0,1. A13: 0,1. A14: 0,1. A15: 0,1. A16: 0,1. A17: 0,1. A18: 0,1. A19: 0,1. A20: 0,1. A21: 0,1. A22: 0,1. A23: 0,1. A24: 0,1. A25: 0,1. A26: 0,1. A27: 0,1. A28: 0,1. A29: 0,1. A30: 0,1. A31: 0,1. A32: 0,1. A33: 0,1. A34: 0,1. A35: 0,1. A36: 0,1. A37: 0,1. A38: 0,1. A39: 0,1. A40: 0,1. A41: 0,1. A42: 0,1. A43: 0,1. A44: 0,1. A45: 0,1. A46: 0,1. A47: 0,1. A48: 0,1. A49: 0,1. A50: 0,1. A51: 0,1. A52: 0,1. A53: 0,1. A54: 0,1. A55: 0,1. A56: 0,1. A57: 0,1. A58: 0,1. A59: 0,1. A60: 0,1. A61: 0,1. A62: 0,1. A63: 0,1. A64: 0,1. A65: 0,1. A66: 0,1. A67: 0,1. A68: 0,1. A69: 0,1. A70: 0,1. A71: 0,1. A72: 0,1. A73: 0,1. A74: 0,1. A75: 0,1. A76: 0,1. A77: 0,1. A78: 0,1. A79: 0,1. A80: 0,1. A81: 0,1. A82: 0,1. A83: 0,1. A84: 0,1. A85: 0,1. A86: 0,1. A87: 0,1. A88: 0,1. A89: 0,1. A90: 0,1. A91: 0,1. A92: 0,1. A93: 0,1. A94: 0,1. A95: 0,1. A96: 0,1. A97: 0,1. A98: 0,1. A99: 0,1. A100: 0,1. A101: 0,1. A102: 0,1. A103: 0,1. A104: 0,1. A105: 0,1. A106: 0,1. A107: 0,1. A108: 0,1. A109: 0,1. A110: 0,1. A111: 0,1. A112: 0,1. A113: 0,1. A114: 0,1. A115: 0,1. A116: 0,1. A117: 0,1. A118: 0,1. A119: 0,1. A120: 0,1. A121: 0,1. A122: 0,1. A123: 0,1. A124: 0,1. A125: 0,1. A126: 0,1. A127: 0,1. A128: 0,1. A129: 0,1. A130: 0,1. A131: 0,1. A132: 0,1. A133: 0,1. A134: 0,1. A135: 0,1. A136: 0,1. A137: 0,1. A138: 0,1. A139: 0,1. A140: 0,1. A141: 0,1. A142: 0,1. A143: 0,1. A144: 0,1. A145: 0,1. A146: 0,1. A147: 0,1. A148: 0,1. A149: 0,1. A150: 0,1. A151: 0,1. A152: 0,1. A153: 0,1. A154: 0,1. A155: 0,1. A156: 0,1. A157: 0,1. A158: 0,1. A159: 0,1. A160: 0,1. A161: 0,1. A162: 0,1. A163: 0,1. A164: 0,1. A165: 0,1. A166: 0,1. A167: 0,1. A168: 0,1. A169: 0,1. A170: 0,1. A171: 0,1. A172: 0,1. A173: 0,1. A174: 0,1. A175: 0,1. A176: 0,1. A177: 0,1. A178: 0,1. A179: 0,1.