| From Eddie at Irvine... Ronnyk | 1. Title of Database: E. coli promoter gene sequences (DNA) | with associated imperfect domain theory | | 2. Sources: | (a) Creators: | - promoter instances: C. Harley (CHARLEY@McMaster.CA) and R. Reynolds | - non-promoter instances and domain theory: M. Noordewier | -- (non-promoters derived from work of lab of Prof. Tom Record, | University of Wisconsin Biochemistry Department) | (b) Donor: M. Noordewier and J. Shavlik, {noordewi,shavlik}@cs.wisc.edu | (c) Date received: 6/30/90 | | 3. Past Usage: | (a) biological: | -- Harley, C. and Reynolds, R. 1987. | "Analysis of E. Coli Promoter Sequences." | Nucleic Acids Research, 15:2343-2361. | machine learning: | -- Towell, G., Shavlik, J. and Noordewier, M. 1990. | "Refinement of Approximate Domain Theories by Knowledge-Based | Artificial Neural Networks." In Proceedings of the Eighth National | Conference on Artificial Intelligence (AAAI-90). | (b) attributes predicted: member/non-member of class of sequences with | biological promoter activity (promoters initiate the process of gene | expression). | (c) Results of study indicated that machine learning techniques (neural | networks, nearest neighbor, contributors' KBANN system) performed as | well/better than classification based on canonical pattern matching | (method used in biological literature). | | 4. Relevant Information Paragraph: | This dataset has been developed to help evaluate a "hybrid" learning | algorithm ("KBANN") that uses examples to inductively refine preexisting | knowledge. Using a "leave-one-out" methodology, the following errors | were produced by various ML algorithms. (See Towell, Shavlik, & | Noordewier, 1990, for details.) | | System Errors Comments | ------ ------ -------- | KBANN 4/106 a hybrid ML system | BP 8/106 std backprop with one hidden layer | O'Neill 12/106 ad hoc technique from the bio. lit. | Near-Neigh 13/106 a nearest-neighbor algo (k=3) | ID3 19/106 Quinlan's decision-tree builder | | Type of domain: non-numeric, nominal (one of A, G, T, C) | -- Note: DNA nucleotides can be grouped into a hierarchy, as shown below: | | X (any) | / \ | (purine) R Y (pyrimidine) | / \ / \ | A G T C | | | 5. Number of Instances: 106 | | 6. Number of Attributes: 59 | -- class (positive or negative) | -- instance name | -- 57 sequential nucleotide ("base-pair") positions | | 7. Attribute information: | -- Statistics for numeric domains: No numeric features used. | -- Statistics for non-numeric domains | -- Frequencies: Promoters Non-Promoters | --------- ------------- | A 27.7% 24.4% | G 20.0% 25.4% | T 30.2% 26.5% | C 22.1% 23.7% | | Attribute #: Description: | ============ ============ | 1 One of {+/-}, indicating the class ("+" = promoter). | 2 The instance name (non-promoters named by position in the | 1500-long nucleotide sequence provided by T. Record). | 3-59 The remaining 57 fields are the sequence, starting at | position -50 (p-50) and ending at position +7 (p7). Each of | these fields is filled by one of {a, g, t, c}. | | 8. Missing Attribute Values: none | | 9. Class Distribution: 50% (53 positive instances, 53 negative instances) +,- attr 01 : a,g,t,c attr 02 : a,g,t,c attr 03 : a,g,t,c attr 04 : a,g,t,c attr 05 : a,g,t,c attr 06 : a,g,t,c attr 07 : a,g,t,c attr 08 : a,g,t,c attr 09 : a,g,t,c attr 10 : a,g,t,c attr 11 : a,g,t,c attr 12 : a,g,t,c attr 13 : a,g,t,c attr 14 : a,g,t,c attr 15 : a,g,t,c attr 16 : a,g,t,c attr 17 : a,g,t,c attr 18 : a,g,t,c attr 19 : a,g,t,c attr 20 : a,g,t,c attr 21 : a,g,t,c attr 22 : a,g,t,c attr 23 : a,g,t,c attr 24 : a,g,t,c attr 25 : a,g,t,c attr 26 : a,g,t,c attr 27 : a,g,t,c attr 28 : a,g,t,c attr 29 : a,g,t,c attr 30 : a,g,t,c attr 31 : a,g,t,c attr 32 : a,g,t,c attr 33 : a,g,t,c attr 34 : a,g,t,c attr 35 : a,g,t,c attr 36 : a,g,t,c attr 37 : a,g,t,c attr 38 : a,g,t,c attr 39 : a,g,t,c attr 40 : a,g,t,c attr 41 : a,g,t,c attr 42 : a,g,t,c attr 43 : a,g,t,c attr 44 : a,g,t,c attr 45 : a,g,t,c attr 46 : a,g,t,c attr 47 : a,g,t,c attr 48 : a,g,t,c attr 49 : a,g,t,c attr 50 : a,g,t,c attr 51 : a,g,t,c attr 52 : a,g,t,c attr 53 : a,g,t,c attr 54 : a,g,t,c attr 55 : a,g,t,c attr 56 : a,g,t,c attr 57 : a,g,t,c