+++++++++++++++++++++++++++++++++++++++++++++++++++++ + + + T2 - Computing optimal 2-level decision trees + + + + by Peter Auer, pauer@igi.tu-graz.ac.at + + + ++++++++++++++++++++++++++++++++++++++++++++++++++++++ This is a short description of the T2 program discussed in P. Auer, R.C. Holte, and W. Maass. Theory and applications of agnostic PAC-learning with small decision trees. In Proc. 7th Int. Machine Learning Conf., Tahoe City (USA), 1995. Please see the paper for a description of the algorithm and a discussion of the results. A copy of the paper comes with this package. (There was a typo in the proceedings version of the paper in Table 2: The Sky2 value for HE is 89.0% instead of 91.0%.) Overview -------- T2 calculates optimal decision trees up to depth 2. T2 accepts exactly the same input as C4.5, consisting of a name-file, a data-file, and an optional test-file. The output of T2 is a decision tree similar to the decision trees of C4.5, but there are some differences. T2 uses two kinds of decision nodes: (1) discrete splits on an discrete attribute where the node has as many branches as there are possible attribute values, and (2) interval splits of continuous attributes. A node which performs an interval split divides the real line into intervals and has as many branches as there are intervals. The number of intervals is restricted to be (a) at most MAXINTERVALS if all the branches of the decision node lead to leaves, and to be (b) at most 2 otherwise. MAXINTERVALS can be set by the user. The attribute value ``unknown'' is treated as a special attribute value. Each decision node (discrete or continuous) has an additional branch which takes care of unknown attribute values. T2 builds the decision tree satisfying the above constraints and minimizing the number of misclassifications of cases in the data-file. Installation ------------ You should have got a directory T2-dir with the following files: Makefile README T2-paper.ps buildtree.c defns.i extern.i funcs.i getdata.c getnames.c global.i header.i init.c interface-types.i interface.i lists.c main.c misc.c output.c splits.c types.i Just say ``make'' and you get the program T2. You will need the GNU CC compiler gcc to do so. If you don't have gcc you might edit the Makefile and replace gcc by any ANSI C compiler. This should work. Running T2 ---------- To run the program type ``T2 ''. Options of T2: -f FILE use FILE.names, FILE.data, FILE.test default FILE='DF' -u evaluate on test file default don't -0 -1 -2 compute 0,1,2-level tree default -2 -i INT use splits into INT intervals at the bottom nodes default INT=Number of Classes + 1 -h gives you this list of options Output: The optimal decision tree and the evaluation on the data-file (and the test-file). Time and Space requirements: Let m ... # cases C ... # classes A ... # attributes I ... # intervals (=INT) Then the computation time scales with (m log m) C^4 I^2 A^2 and the allocated memory scales with m C^2 I^2 Using T2 as a subroutine ------------------------ For those who want to use T2 as a subroutine in their programs I suggest to look at the files main.c interface-types.i interface.i These files contain the declarations and their usage which are (hopefully) easy to incorporate into other programs. Beyond that the code is rather messy but feel free to dig into it. Peter Auer, August 1995