file: mktree.readme -- Documentation on the main tree-building function of OC1.

****************************************************************
* Copyright (C) 1993 by:                                        *
*           Sreerama K. Murthy  (murthy@cs.jhu.edu)             *
*           Steven Salzberg (salzberg@cs.jhu.edu)               *
*           Simon Kasif (kasif@cs.jhu.edu)                      *
****************************************************************

Welcome to OC1!

Mktree is the main tree-building program of OC1.  This program builds
oblique and axis parallel decision trees from examples, and it also
runs experiments of many types to estimate the accuracy of these
trees.  With the proper parameters, it can divide your data
randomly into two partitions, and use one for training (building the
tree) and the other for testing (estimating the tree's accuracy as a
classifier).  It can also do k-fold cross-validation experiments.  OC1
has many other options available, most of which can be specified via
simple command line options to mktree.  All of these options are
described below.  The only option that requires you to change a
file is the choice of impurity measure.  This is the measure that OC1
uses to estimate the "goodness" of a test at each node in a tree.
Six impurity measures are included in this package, but you can write
your own as well.  See the documentation below for more details.

"mktree" allows  you to do the following types of things.
  
  1. Build a decision tree based on a randomly chosen portion of a labelled 
     dataset, prune it and use it to classify the rest of the dataset,
  2. Build a decision tree based on one dataset, and estimate the 
     classification accuracy of the tree on another labelled dataset.
  3. Perform a k-fold cross validation experiment.  The data will be divided
     randomly into k partitions.  For each partition p, mktree builds a tree
     on all the examples not in p, and tests the tree on the elements of p.
     It produces a report giving the accuracy on the entire dataset, since
     every example is tested exactly once.
  4. Classify (i.e., label) the objects of a dataset with a pre-computed
     decision tree.

mktree assumes the following about a dataset:
  1. All attributes are real-valued.
  2. There are no missing attribute values.

"mktree" has a lot of options, but it is possible to use the program
without remembering all of these. All options have (what we think are)
reasonable default values.  The options are listed below, with
explanations where necessary.  You can read the file sample_session
to see some examples of how to use the mktree command.

     -t  : Data file used for training.
           If no file name for test data is specified (with the -T option),
           and if the number of training exaples given with the -n
           option is less than the number of examples in the above file,
           then test data is also read from this file.
           No default.
     -n  : number of training examples 
           Default = All examples in the file name given with the -t option.
           If this number is less than the # data points in the file specified
           with the -t option, then mktree randomly chooses n data points for
           the training set, and the rest of the file for the test set. 
           Overridden by the -T option. 
     -T  : Data file used for testing.
           No default.
           Overrides the -n option.
           Is overridden by the -V option.
     -V  : number of folds for cross validation  
           -1 : leave-one-out cross validation, 
            0 : no cross validation 
           If a number k is specified, the data are divided randomly into
           k partitions.  If k does not even divide into the number of 
           examples, then the extra examples are put in the last partition.
           Each partition is used once as a test set, and mktree builds a 
           tree from the remaining examples.  "Leave-one-out" is equivalent
           to setting k=total number of examples.
           Default = 0.
     -d  : number of dimensions (attributes) 
           Overridden if a decision tree is provided with the -D option, 
           AND if no training set is specified with the -t option. In such 
           a case, the #of dimensions are read from the decision tree file.
           Usually the default value is used.
           Default:  The number of white-space separated real numbers in line
           1 of the training data file, minus 1. (The last field is assumed 
           to be the category of the example).
     -c  : number of categories (classes) 
           Overridden if a decision tree is provided with the -D option, 
           AND if no training set is specified with the -t option. In such 
           a case, the #of categories are read from the decision tree file.
           Usually the default value is used.
           Default = Number of different class numbers in the datafile.
     -a  : Consider only axis parallel splits at each node of the DT.
           With this option, OC1 builds axis parallel trees.  Thus it
           becomes equivalent to other methods such as C4.5 and CART
           with the appropriate impurity measures.
           Default = Off.
           overrides the -i,-b,-r, and -m options.
     -b  : This option (takes no arguments and) indicates that the coefficients
           will be perturbed in "best first" order.  At each node of the
           tree, OC1 considers how to perturb (adjust) the coefficients of
           the hyperplane it is considering at any given time.  For each
           coefficient, n different values are considered and the best one
           is saved.  This option tells OC1 to perturb the coefficient, that
           gives the greatest improvement in the impurity measure, first.
           Default = Sequential.
           Sequential means that OC1 will cycle through all the coefficients 
           of a hyperplane in order, perturbing each one to the best value
           it can find for that coefficient.
     -r  : This option (with an argument) specifies that the
           order of coefficient perturbation is random, and gives the  
           number of random coefficient perturbations.  Thus OC1 will
           pick a random coefficient to adjust, and will perturb it to
           its best value.  It will repeat this however many times are
           specified here.
           Default number of random perturbations of the coefficents is: 
           (10 * number of attributes).
           Experimentally, this option has not been shown to improve OC1
           in any case, but we retained it as an experimental alternative.
     -v  : verbose output
           Tells mktree to print lots of information about what it is doing.
           Default = Off
     -i  : number of iterations 
           Default = 20
           "i=20" means that, at every node of the decision tree, we start with
           20 different random hyperplanes, perturb each one of them to the
           best position possible, and choose the best of the resulting 20
           locations.  Choosing a larger value of i will slow down the 
           program, but will often result in better trees, since the program
           gets more chances at each node to find a good hyperplane.
     -s  : integer seed for the random number generator
           Default : system default for the srand48() system call.
     -m  : maximum number of random jumps
           Default = 5 
           When OC1 cannot improve any of the coefficients of a hyperplane
           deterministically, it is stuck in a local minimum.  It then 
           attempts to jump out of this minimum by choosing a random 
           direction, and sliding the hyperplane in that direction. There 
           are only n distinct places to slide it for n points, so we check 
           all of them to see if any improve things.
           "m=5" means that we try at most 5 times to jump in a random
           direction, and we return to the deterministic perturbation algorithm
           as soon as we find an improvement in the impurity measure.
     -p  : pruning portion 
           Default = 0.1
           OC1 always prunes decision trees by default, to avoid the problem
           of overfitting. Currently, the only pruning method implemented is 
           error complexity pruning using separate pruning dataset. This 
           option specifies to use a randomly chosen set of 
           (pruning_portion * training set size) points from the training set
           exclusively for pruning. 
           eg: pruning_portion=0.3 means that 7/10s of the training data is 
           used in actual building of the decision tree, and 3/10s in pruning. 
           If pruning_portion = 0, no pruning is done. 
     -D  : Decision tree file.
           If a training set is specified with the -t option, then OC1
           builds a tree and writes it to this file.  If no training set
           is specified, then OC1 assumes you want to use a pre-built tree
           to classify some data, and it reads the decision tree from this
           file.
           In case the tree is to be output, Default = <training data>.dt
     -M  : file to list misclassified examples in the test data
           Default = None
           Some users want to know which examples OC1 misclassified in
           an experiment.  If you specify this option, OC1 will write all
           misclassified examples to the file you name.
     -l  : Log file.
           Default = oc1.log
           For every run of "mktree", all the parameter values are
           listed in this file. It is useful mainly when default values
           of parameters are used.  


In addition to the parameters that can be specified using program
options, one more very important parameter is the "impurity measure"
or metric that is used to determine the "goodness" of a hyperplane
location.  The choice of this parameter often makes a big difference
in the size and quality of the resulting tree.  "mktree" can use any
impurity measure you wish to define, and four different ones have been
pre-defined for you.

To check which impurity measure is currently being used, check the
value of the constant IMPURITY in the file "oc1.h".  To change the
measure being used, simply change this constant and re-compile.
Implementations of six impurity measures: 
    1. max minority,
    2. sum minority,
    3. sum of impurities, and 
    4. information gain
    5. gini_index
    6. twoing rule
are included in the system (in the file impurity_measures.c).  (For a
description of some of these impurity measures, refer to oc1.ps, the
postscript version of our AAAI-93 paper).  To use a customized
impurity measure "mymeasure", do the following.

1. Add a function "mymeasure()" to the file "impurity_measures.c", that takes
   no arguments and returns a float.
2. This function can use two integer arrays "left_count" and 
   "right_count", each of length "no_of_categories", specifying how many 
   points of each category lie on the left and right sides of the hyperplane.
   Note: These are declared as external variables in compute_impurity(). 
   Just use them as read-only global variables to compute the impurity.  
3. "mymeasure" should explicitly return a float value, which is the
   impurity.  OC1 tries to find a location of the hyperplane that
   MINIMIZES this impurity.  Special care should be taken for measures
   like Information Gain that (in a sense) measure the "purity" of a
   split, and not its impurity. One way to maintain uniformity is to
   return the negative or inverse of the purity (see the implementation 
   for InfoGain and Twoing).
4. Define the # constant "IMPURITY" to be "mymeasure()", in the 
   file "oc1.h".

 
Some useful suggestions on parameter settings :
 
1. Overfitting is usually caused if the numbers given with the -i and -m
   options are more than they should be.
2. If the classification accuracy is much more for one of the categories than
   the others, mktree is probably pruning the tree more than it should.
   Run the program in the verbose mode to see the different stages in pruning.
3. The impurity measure "Summinority" almost always gives the smallest
   trees, but not necessarily the most accurate. So, if you are interested
   in getting as small trees as possible, try Summinority first. If the
   classification accuracies of the trees thus obtained are unsatisfactory,
   you can experiment with other impurity measures.
4. When trying to induce decision trees on reasonably large datasets, or
   when running cross validation experiments, it is better to run mktree in
   verbose mode (-v option), so that you know which stage of tree induction
   the program is in.