MLC++ Datafiles --------------- Most databases are from Irvine and Quinlan. Irvine datafiles have been converted to MLC++/C4.5 format. For datasets that did not have a prespecified train/test set, we generated those by a 1/3 split for test set, 2/3 for training set. A ".all" extension contains all instances. The file "datasets.txt" is the suggested extensions to use. .all means CV is best done on these datafiles, no extension means a train/test set (.data,.test) is best done. There are two reasons for doing a train/test as opposed to cross validation: 1. In artificial domains such as the monk problems there is a prespecified training set. Giving one third of the full space would be too much. 2. Datasets with large amounts of data such as dna, letter assure of enough confidence in the result that CV would be too expensive and not worthwhile. *. The "-full" versions have meaningful attribute values instead of 1,2,3 (e.g. the monk problems). *. The "-local" versions have discrete attributes encoded using indicator variables (local encoding). *. vote.{data,test} is the split quinlan made (same for vote1) *. vote-irvine.{data,test} is a random 1/3 split for testing. *. DNA from statlog is marked "continuous" but it has only two values 0/1. dnaD is the discrete version (same data/test files), but names file says 0,1. In order to avoid RSI problem, the following renames were done: 1. agaricus-lepiota-full.data has been renamed "mushroom" with attribute value converted to meaningful names. 2. pima-indians-diabetes has been renamed "pima" 3. breast-cancer-wisconsin has been renamed "breast" with attribute value converted to meaningful names. Note that breast and breast-cancer are two different databases! -- Ronny Kohavi (mlc@CS.Stanford.EDU, http://robotics.stanford.edu:/users/ronnyk/mlc.html)