====================================================================== David W. Aha May 1993 My implemenation of Steven Salzberg's NGE ====================================================================== I've tried to make it simple, incorporating both Steven's ideas and the ideas from Oregon State's folks: NGE(tr,te,#seeds,Feature_adjustment_rate,greedy-flag) Input the tr and te Normalize all data (simple linear normalization for now) Replace all unknown values with [0,1] for numeric features or equivalent of "ALL" for symbolic features Replace all known values with [v,v] for numeric features or equivalent of {v} for symbolic features Select #seeds from the tr; prevent them from being used during training Setup initial weights for seeds: 1. Exemplar weights: 1/1 (#total_uses/#correct) 2. Attribute weights: set them all to 1 Train: For each (other) element of tr: 1. Compute best two matches: use the formula from Steven's paper 2. If best match matches Then correct-update weights for best matcher & extend it Else incorrect-update weights for best matcher If greedy flag is set Then If 2nd best match matches Then correct-update weights for 2nd best matcher & extend it Else incorrect-update weights for 2nd best matcher save the new instance as a new exemplar Else save the new instance as a new exemplar Test: Find best match; add 1 to #correct if classification is identical Updating weights: 1. Exemplar weights: Simple addition of #uses and #correct (if so) 2. Attribute weights: Assuming correct classification: 1. If they match, multiply by (1+Feature_adjustment_rate) 2. If they mismatch, multiply by (1-Feature_adjustment_rate) Assuming incorrect classification: switch (1) with (2) above Note: 1. Features match if symbolic or in hyperrectangle-range 2. Handle symbolic features separately from numerics; a different similarity measure for symbolic features is used (overlap metric) 3. Extensions of symbolic-valued hyper-rectangles involves adding new values to the set of values covered. I'll use a binary array of length MAX_NUMBER_OF_SYMBOLIC_VALUES_PER_FEATURE to encode which values are "in" the hyper-rectangle for the given feature (i.e., only those with a "1" for the corresponding array index). Um, we'll have to map the possible symbolic features to zero-indexed values, requiring them to be known at time of input. Missing values will have "1"'s for each element of the array. I'm doing this by adapting my IBL code, making minimal changes but a stand-alone, general system that can run either the greedy algorithm or the non-greedy algorithm. Ways that my algorithm differs from Steven's: 1. Normalization done on all instances beforehand 2. Everything's a hyper-rectangle; it works out to be the same algorithm 3. Ties in favor of "smaller" sized hyperrectangles, but I handle symbolic valued attributes differently here; they contribute #included/#values towards the product that is used to compute area covered. 4. I prepared my datasets differently than did Steven; this is probably the main factor for causing the observed differences. I'm looking at *many* different variants of this now. 5. Unknown attributes take on the entire set of values for the attribute. That is, they match anything. Summary of results: Breast cancer: (averaged over 10 runs per settings) #symbolic attributes -far Best mean result (#seeds) -------------------------- ---- ------------------------- 2 0.20 63.7% (150 seeeds) 1 0.20 68.2% (65 seeds) 1 0.05 67.3% (150 seeeds) 0 (eliminated att #8) 0.05 73.3% (15 seeds) 0 (eliminated att #8) 0.20 71.8% (10 seeds) 0 (eliminated atts #2 & #8) 0.20 70.4% (15 seeds) Voting: (averaged over 10 runs per settings) -far mean results (#seeds) ---- ------------------------- 0.20 63.5% (25 seeds) 92.8% (190 seeds) (increases steadily as #seeds is increased) LED display: (averaged over 10 runs per settings) -far mean results (#seeds) ---- ------------------------- 0.20 53.3% (5 seeds) 60.1% (20 seeds) (seems to hover between 55% and 60%) 0.05 52.9% (30 seeds) 61.4% (50 seeds)