MLC++ Datafiles
---------------


Most databases are from Irvine and Quinlan.  Irvine datafiles have
  been converted to MLC++/C4.5 format.  For datasets that did not
  have a prespecified train/test set, we generated those
  by a 1/3 split for test set, 2/3 for training set.
  A ".all" extension contains all instances.

The file "datasets.txt" is the suggested extensions to use.
  .all means CV is best done on these datafiles, no extension
  means a train/test set (.data,.test) is best done.
There are two reasons for doing a train/test as opposed to cross validation:
  1. In artificial domains such as the monk problems there is
     a prespecified training set.  Giving one third of the full
     space would be too much.
  2. Datasets with large amounts of data such as dna, letter
     assure of enough confidence in the result that CV would be
     too expensive and not worthwhile. 

*. The "-full" versions have meaningful attribute values instead of 1,2,3
   (e.g. the monk problems).

*. The "-local" versions have discrete attributes encoded
   using indicator variables (local encoding).  

*. vote.{data,test} is the split quinlan made (same for vote1)
*. vote-irvine.{data,test} is a random 1/3 split for testing.

*. DNA from statlog is marked "continuous" but it has only two values 0/1.
   dnaD is the discrete version (same data/test files), but
   names file says 0,1.


In order to avoid RSI problem, the following renames were done:
  1. agaricus-lepiota-full.data has been renamed "mushroom"
     with attribute value converted to meaningful names.
  2. pima-indians-diabetes has been renamed "pima"
  3. breast-cancer-wisconsin has been renamed "breast"
     with attribute value converted to meaningful names.
     Note that breast and breast-cancer are two different databases!

--

   Ronny Kohavi (mlc@CS.Stanford.EDU,
                 http://robotics.stanford.edu:/users/ronnyk/mlc.html)