| .names file created by George John, October 1994
| Processing:
| * A,C,T,G -> 001,010,100,000  Seems biased against systems that can handle
|   categorical attributes
| 
| 
|
|1. TITLE:
|	DNA Dataset (STATLOG version) - Primate splice-junction gene sequences (DNA)
|        with associated imperfect domain theory
|
|	PROBLEM DESCRIPTION
|	Splice junctions are points on a DNA sequence at which `superfluous' DNA is
|	removed during the process of protein creation in higher organisms.  The
|	problem posed in this dataset is to recognize, given a sequence of DNA, the
|	boundaries between exons (the parts of the DNA sequence retained after
|	splicing) and introns (the parts of the DNA sequence that are spliced
|	out). 
|
|	PURPOSE
|	This problem consists of two subtasks: recognizing exon/intron
|	boundaries (referred to as EI sites), and recognizing intron/exon boundaries
|	(IE sites). (In the biological community, IE borders are referred to
|	a ``acceptors'' while EI borders are referred to as ``donors''.)
| 
|2. USE IN STATLOG
|
|	2.1- Testing Mode		
|		Train & Test
|
|	2.2- Special Preprocessing	
|		Yes
|
|	2.3- Test Results
|		
|		Algorithm	Sucess Rate
|		=========	===========
|		Radial		95.90
|		Dipol92		95.200
|		Alloc80		94.300
|		QuaDisc		94.100
|		Discrim		94.100
|		LogDisc		93.900
|		Bayes		93.200
|		Castle		92.800
|		IndCart		92.700
|		C4.5		92.400
|		Cart		91.500
|		BackProp	91.200
|		BayTree		90.500
|		Cn2		90.500
|		Ac2		90.000
|		NewId		90.000
|		Cal5		86.900
|		Itrule		86.500
|		Smart		85.900
|		KNN		84.500
|		Kohonen		66.10
|		Default		52.000
|		LVQ		0.000
|		Cascade		0.000
|
|3. SOURCES and PAST USAGE
|   3.1 SOURCES
|   	(a) Creators: 
|       		- all examples taken from Genbank 64.1 (ftp site: genbank.bio.net)
|       		- categories "ei" and "ie" include every "split-gene" 
|         	for primates in Genbank 64.1
|       		- non-splice examples taken from sequences known not to include
|         	a splicing site 
|   	(b) Donor: G. Towell, M. Noordewier, and J. Shavlik, 
|              {towell,shavlik}@cs.wisc.edu, noordewi@cs.rutgers.edu
|   	(c) Date received: 1/1/92
|
|	The StaLog dna dataset is a processed vesrion of the Irvine 
|	database described below.  The main difference is that the 
|	symbolic variables representing the nucleotides (only A,G,T,C) 
|	were replaced by 3 binary indicator variables.  Thus the original 
|	60 symbolic attributes were changed into 180 binary attributes.  
|	The names of the examples were removed.  The examples with 
|	ambiguities were removed (there was very few of them, 4).   
|	The StatLog version of this dataset was produced by Ross King
|	at Strathclyde University.   For original details see the Irvine 
|	database documantation.
|
|	The nucleotides A,C,G,T were given indicator values as follows
|
|		A -> 1 0 0
|    		C -> 0 1 0
|    		G -> 0 0 1
|    		T -> 0 0 0
|
|	The class values are 
|		ei -> 1
|                ie -> 2
|                n  -> 3
|   3.2 PAST USAGE
|
|	(a) machine learning:
|       	-- M. O. Noordewier and G. G. Towell and J. W. Shavlik, 1991; 
|           "Training Knowledge-Based Neural Networks to Recognize Genes in 
|           DNA Sequences".  Advances in Neural Information Processing Systems,
|           volume 3, Morgan Kaufmann.
|
|	-- G. G. Towell and J. W. Shavlik and M. W. Craven, 1991;  
|           "Constructive Induction in Knowledge-Based Neural Networks",  
|           In Proceedings of the Eighth International Machine Learning
|	   Workshop, Morgan Kaufmann.
|
|        -- G. G. Towell, 1991;
|           "Symbolic Knowledge and Neural Networks: Insertion, Refinement, and
|           Extraction", PhD Thesis, University of Wisconsin - Madison.
|
|        -- G. G. Towell and J. W. Shavlik, 1992;
|           "Interpretation of Artificial Neural Networks: Mapping 
|           Knowledge-based Neural Networks into Rules", In Advances in Neural
|           Information Processing Systems, volume 4, Morgan Kaufmann.
|
|   	(b) attributes predicted: given a position in the middle of a window
|       		60 DNA sequence elements (called "nucleotides" or "base-pairs"),
|       		decide if this is a
|		a) "intron -> exon" boundary (ie) [These are sometimes called "donors"]
|		b) "exon -> intron" boundary (ei) [These are sometimes called "acceptors"]
|		c) neither                      (n)
|   	(c) Results of study indicated that machine learning techniques (neural
|       		networks, nearest neighbor, contributors' KBANN system) performed as
|       		well/better than classification based on canonical pattern matching
|       		(method used in biological literature).
|
|	HISTORY
|	This dataset has been developed to help evaluate a "hybrid" learning
|   	algorithm (KBANN) that uses examples to inductively refine preexisting
|   	knowledge.  Using a "ten-fold cross-validation" methodology on 1000
|   	examples randomly selected from the complete set of 3190, the following 
|   	error rates were produced by various ML algorithms (all experiments
|   	run at the Univ of Wisconsin, sometimes with local implementations
|   	of published algorithms). 
|
|                System	       Neither    EI      IE
|                ----------     -------  -----   -----
|		KBANN    	 4.62    7.56    8.47
|		BACKPROP    	 5.29    5.74   10.75
|		PEBLS    	 6.86    8.18    7.55
|		PERCEPTRON    	 3.99   16.32   17.41
|		ID3    		 8.84   10.58   13.99
|		COBWEB   	11.80   15.04    9.46
|		Near. Neighbor	31.11   11.65    9.09
|	     	
|   	Type of domain: non-numeric, nominal (one of A, G, T, C)
|
|*************************************************************
|
|4. DATASET DISCRIPTION
|	NUMBER OF EXAMPLES: 
|		3186
|
|		Train	2000
|		Test	1186
|
|	NUMBER OF CLASSES: 
|		3 (one of 1,2,3)
|
|		Distribution of classes
|			Class	Train		Test
|			------------------------------------	
|			1	 464 (23.20%)	303 (25.55%)
|		 	2	 485 (24.25%)	280 (23.61%)
|		 	3	1051 (52.55%)	603 (50.84%)
|	
|	NUMBER OF ATTRIBUTES:
|		180 binary indicator variables
|
|	Hint.   Much better performance is generally observed if attributes
|		closest to the junction are used.
|		In the StatLog version, this means using
|		attributes A61 to A120 only.
|
|
|CONTACTS
|	statlog-adm@ncc.up.pt
|	bob@stams.strathclyde.ac.uk
|	
|
|================================================================================
|;little lisp function to generate names:
|(defun atts ()
|  (let ((i 1))
|    (while (<= i 180)
|      (insert (format "A%s: continuous.\n" i))
|     (setq i (+ 1 i)))))

1,2,3. | classes
A0: 0,1.
A1: 0,1.
A2: 0,1.
A3: 0,1.
A4: 0,1.
A5: 0,1.
A6: 0,1.
A7: 0,1.
A8: 0,1.
A9: 0,1.
A10: 0,1.
A11: 0,1.
A12: 0,1.
A13: 0,1.
A14: 0,1.
A15: 0,1.
A16: 0,1.
A17: 0,1.
A18: 0,1.
A19: 0,1.
A20: 0,1.
A21: 0,1.
A22: 0,1.
A23: 0,1.
A24: 0,1.
A25: 0,1.
A26: 0,1.
A27: 0,1.
A28: 0,1.
A29: 0,1.
A30: 0,1.
A31: 0,1.
A32: 0,1.
A33: 0,1.
A34: 0,1.
A35: 0,1.
A36: 0,1.
A37: 0,1.
A38: 0,1.
A39: 0,1.
A40: 0,1.
A41: 0,1.
A42: 0,1.
A43: 0,1.
A44: 0,1.
A45: 0,1.
A46: 0,1.
A47: 0,1.
A48: 0,1.
A49: 0,1.
A50: 0,1.
A51: 0,1.
A52: 0,1.
A53: 0,1.
A54: 0,1.
A55: 0,1.
A56: 0,1.
A57: 0,1.
A58: 0,1.
A59: 0,1.
A60: 0,1.
A61: 0,1.
A62: 0,1.
A63: 0,1.
A64: 0,1.
A65: 0,1.
A66: 0,1.
A67: 0,1.
A68: 0,1.
A69: 0,1.
A70: 0,1.
A71: 0,1.
A72: 0,1.
A73: 0,1.
A74: 0,1.
A75: 0,1.
A76: 0,1.
A77: 0,1.
A78: 0,1.
A79: 0,1.
A80: 0,1.
A81: 0,1.
A82: 0,1.
A83: 0,1.
A84: 0,1.
A85: 0,1.
A86: 0,1.
A87: 0,1.
A88: 0,1.
A89: 0,1.
A90: 0,1.
A91: 0,1.
A92: 0,1.
A93: 0,1.
A94: 0,1.
A95: 0,1.
A96: 0,1.
A97: 0,1.
A98: 0,1.
A99: 0,1.
A100: 0,1.
A101: 0,1.
A102: 0,1.
A103: 0,1.
A104: 0,1.
A105: 0,1.
A106: 0,1.
A107: 0,1.
A108: 0,1.
A109: 0,1.
A110: 0,1.
A111: 0,1.
A112: 0,1.
A113: 0,1.
A114: 0,1.
A115: 0,1.
A116: 0,1.
A117: 0,1.
A118: 0,1.
A119: 0,1.
A120: 0,1.
A121: 0,1.
A122: 0,1.
A123: 0,1.
A124: 0,1.
A125: 0,1.
A126: 0,1.
A127: 0,1.
A128: 0,1.
A129: 0,1.
A130: 0,1.
A131: 0,1.
A132: 0,1.
A133: 0,1.
A134: 0,1.
A135: 0,1.
A136: 0,1.
A137: 0,1.
A138: 0,1.
A139: 0,1.
A140: 0,1.
A141: 0,1.
A142: 0,1.
A143: 0,1.
A144: 0,1.
A145: 0,1.
A146: 0,1.
A147: 0,1.
A148: 0,1.
A149: 0,1.
A150: 0,1.
A151: 0,1.
A152: 0,1.
A153: 0,1.
A154: 0,1.
A155: 0,1.
A156: 0,1.
A157: 0,1.
A158: 0,1.
A159: 0,1.
A160: 0,1.
A161: 0,1.
A162: 0,1.
A163: 0,1.
A164: 0,1.
A165: 0,1.
A166: 0,1.
A167: 0,1.
A168: 0,1.
A169: 0,1.
A170: 0,1.
A171: 0,1.
A172: 0,1.
A173: 0,1.
A174: 0,1.
A175: 0,1.
A176: 0,1.
A177: 0,1.
A178: 0,1.
A179: 0,1.