file: mktree.readme -- Documentation on the main tree-building function of OC1. **************************************************************** * Copyright (C) 1993 by: * * Sreerama K. Murthy (murthy@cs.jhu.edu) * * Steven Salzberg (salzberg@cs.jhu.edu) * * Simon Kasif (kasif@cs.jhu.edu) * **************************************************************** Welcome to OC1! Mktree is the main tree-building program of OC1. This program builds oblique and axis parallel decision trees from examples, and it also runs experiments of many types to estimate the accuracy of these trees. With the proper parameters, it can divide your data randomly into two partitions, and use one for training (building the tree) and the other for testing (estimating the tree's accuracy as a classifier). It can also do k-fold cross-validation experiments. OC1 has many other options available, most of which can be specified via simple command line options to mktree. All of these options are described below. The only option that requires you to change a file is the choice of impurity measure. This is the measure that OC1 uses to estimate the "goodness" of a test at each node in a tree. Six impurity measures are included in this package, but you can write your own as well. See the documentation below for more details. "mktree" allows you to do the following types of things. 1. Build a decision tree based on a randomly chosen portion of a labelled dataset, prune it and use it to classify the rest of the dataset, 2. Build a decision tree based on one dataset, and estimate the classification accuracy of the tree on another labelled dataset. 3. Perform a k-fold cross validation experiment. The data will be divided randomly into k partitions. For each partition p, mktree builds a tree on all the examples not in p, and tests the tree on the elements of p. It produces a report giving the accuracy on the entire dataset, since every example is tested exactly once. 4. Classify (i.e., label) the objects of a dataset with a pre-computed decision tree. mktree assumes the following about a dataset: 1. All attributes are real-valued. 2. There are no missing attribute values. "mktree" has a lot of options, but it is possible to use the program without remembering all of these. All options have (what we think are) reasonable default values. The options are listed below, with explanations where necessary. You can read the file sample_session to see some examples of how to use the mktree command. -t : Data file used for training. If no file name for test data is specified (with the -T option), and if the number of training exaples given with the -n option is less than the number of examples in the above file, then test data is also read from this file. No default. -n : number of training examples Default = All examples in the file name given with the -t option. If this number is less than the # data points in the file specified with the -t option, then mktree randomly chooses n data points for the training set, and the rest of the file for the test set. Overridden by the -T option. -T : Data file used for testing. No default. Overrides the -n option. Is overridden by the -V option. -V : number of folds for cross validation -1 : leave-one-out cross validation, 0 : no cross validation If a number k is specified, the data are divided randomly into k partitions. If k does not even divide into the number of examples, then the extra examples are put in the last partition. Each partition is used once as a test set, and mktree builds a tree from the remaining examples. "Leave-one-out" is equivalent to setting k=total number of examples. Default = 0. -d : number of dimensions (attributes) Overridden if a decision tree is provided with the -D option, AND if no training set is specified with the -t option. In such a case, the #of dimensions are read from the decision tree file. Usually the default value is used. Default: The number of white-space separated real numbers in line 1 of the training data file, minus 1. (The last field is assumed to be the category of the example). -c : number of categories (classes) Overridden if a decision tree is provided with the -D option, AND if no training set is specified with the -t option. In such a case, the #of categories are read from the decision tree file. Usually the default value is used. Default = Number of different class numbers in the datafile. -a : Consider only axis parallel splits at each node of the DT. With this option, OC1 builds axis parallel trees. Thus it becomes equivalent to other methods such as C4.5 and CART with the appropriate impurity measures. Default = Off. overrides the -i,-b,-r, and -m options. -b : This option (takes no arguments and) indicates that the coefficients will be perturbed in "best first" order. At each node of the tree, OC1 considers how to perturb (adjust) the coefficients of the hyperplane it is considering at any given time. For each coefficient, n different values are considered and the best one is saved. This option tells OC1 to perturb the coefficient, that gives the greatest improvement in the impurity measure, first. Default = Sequential. Sequential means that OC1 will cycle through all the coefficients of a hyperplane in order, perturbing each one to the best value it can find for that coefficient. -r : This option (with an argument) specifies that the order of coefficient perturbation is random, and gives the number of random coefficient perturbations. Thus OC1 will pick a random coefficient to adjust, and will perturb it to its best value. It will repeat this however many times are specified here. Default number of random perturbations of the coefficents is: (10 * number of attributes). Experimentally, this option has not been shown to improve OC1 in any case, but we retained it as an experimental alternative. -v : verbose output Tells mktree to print lots of information about what it is doing. Default = Off -i : number of iterations Default = 20 "i=20" means that, at every node of the decision tree, we start with 20 different random hyperplanes, perturb each one of them to the best position possible, and choose the best of the resulting 20 locations. Choosing a larger value of i will slow down the program, but will often result in better trees, since the program gets more chances at each node to find a good hyperplane. -s : integer seed for the random number generator Default : system default for the srand48() system call. -m : maximum number of random jumps Default = 5 When OC1 cannot improve any of the coefficients of a hyperplane deterministically, it is stuck in a local minimum. It then attempts to jump out of this minimum by choosing a random direction, and sliding the hyperplane in that direction. There are only n distinct places to slide it for n points, so we check all of them to see if any improve things. "m=5" means that we try at most 5 times to jump in a random direction, and we return to the deterministic perturbation algorithm as soon as we find an improvement in the impurity measure. -p : pruning portion Default = 0.1 OC1 always prunes decision trees by default, to avoid the problem of overfitting. Currently, the only pruning method implemented is error complexity pruning using separate pruning dataset. This option specifies to use a randomly chosen set of (pruning_portion * training set size) points from the training set exclusively for pruning. eg: pruning_portion=0.3 means that 7/10s of the training data is used in actual building of the decision tree, and 3/10s in pruning. If pruning_portion = 0, no pruning is done. -D : Decision tree file. If a training set is specified with the -t option, then OC1 builds a tree and writes it to this file. If no training set is specified, then OC1 assumes you want to use a pre-built tree to classify some data, and it reads the decision tree from this file. In case the tree is to be output, Default = .dt -M : file to list misclassified examples in the test data Default = None Some users want to know which examples OC1 misclassified in an experiment. If you specify this option, OC1 will write all misclassified examples to the file you name. -l : Log file. Default = oc1.log For every run of "mktree", all the parameter values are listed in this file. It is useful mainly when default values of parameters are used. In addition to the parameters that can be specified using program options, one more very important parameter is the "impurity measure" or metric that is used to determine the "goodness" of a hyperplane location. The choice of this parameter often makes a big difference in the size and quality of the resulting tree. "mktree" can use any impurity measure you wish to define, and four different ones have been pre-defined for you. To check which impurity measure is currently being used, check the value of the constant IMPURITY in the file "oc1.h". To change the measure being used, simply change this constant and re-compile. Implementations of six impurity measures: 1. max minority, 2. sum minority, 3. sum of impurities, and 4. information gain 5. gini_index 6. twoing rule are included in the system (in the file impurity_measures.c). (For a description of some of these impurity measures, refer to oc1.ps, the postscript version of our AAAI-93 paper). To use a customized impurity measure "mymeasure", do the following. 1. Add a function "mymeasure()" to the file "impurity_measures.c", that takes no arguments and returns a float. 2. This function can use two integer arrays "left_count" and "right_count", each of length "no_of_categories", specifying how many points of each category lie on the left and right sides of the hyperplane. Note: These are declared as external variables in compute_impurity(). Just use them as read-only global variables to compute the impurity. 3. "mymeasure" should explicitly return a float value, which is the impurity. OC1 tries to find a location of the hyperplane that MINIMIZES this impurity. Special care should be taken for measures like Information Gain that (in a sense) measure the "purity" of a split, and not its impurity. One way to maintain uniformity is to return the negative or inverse of the purity (see the implementation for InfoGain and Twoing). 4. Define the # constant "IMPURITY" to be "mymeasure()", in the file "oc1.h". Some useful suggestions on parameter settings : 1. Overfitting is usually caused if the numbers given with the -i and -m options are more than they should be. 2. If the classification accuracy is much more for one of the categories than the others, mktree is probably pruning the tree more than it should. Run the program in the verbose mode to see the different stages in pruning. 3. The impurity measure "Summinority" almost always gives the smallest trees, but not necessarily the most accurate. So, if you are interested in getting as small trees as possible, try Summinority first. If the classification accuracies of the trees thus obtained are unsatisfactory, you can experiment with other impurity measures. 4. When trying to induce decision trees on reasonably large datasets, or when running cross validation experiments, it is better to run mktree in verbose mode (-v option), so that you know which stage of tree induction the program is in.