Bag Of Words Library README *************************** `libbow', version {No Value For "BOW_VERSION"}. `Libbow' is a library of C code intended for writing statistical text-processing programs. This distribution includes the library, as well as a text classification front-end, and a document retrieval front-end. The library provides facilities for: Recursively descending directories, finding text files. Finding `document' boundaries when there are multiple docs per file. Tokenizing a text file, according to several different methods. Including N-grams among the tokens. Mapping strings to integers and back again, very efficiently. Building a matrix of document/token counts. Pruning vocabulary by occurrence counts or by information gain. Building and manipulating word vectors. Setting word vector weights according to NaiveBayes, TFIDF, and a simple form of Probabilistic Indexing. Scoring queries for retrieval or classification. Writing all data structures to disk in a machine-architecture- independent format. Reading the document/token matrix from disk in an efficient, sparse fashion. Performing test/train splits, and automatic classification tests. It should compile on most UNIX systems, and WindowsNT (with a GNU build environment). The code conforms to the GNU coding standards. It is released under the Library GNU Public License. The library does not: Have parsing facilities. Do smoothing across N-gram models. Claim to be finished. Have good documentation. Claim to be bug-free. ...many other things. Rainbow ======= `Rainbow' is a standalone program that does document classification. Here are some examples: * rainbow -i ./training/positive ./training/negative Using the text files found under the directories `./positive' and `./negative', tokenize, build word vectors, and write the resulting data structures to disk. * rainbow -q ./testing/254 Tokenize the text document `./testing/254', and classify it, producing output like: /home/mccallum/training/positive 0.72 /home/mccallum/training/negative 0.28 * rainbow -t 5 Perform 5 trials, each consisting of a test/train split, a resetting of weights according to the new split, and outputs of the classification of the test documents. Typing `rainbow --help' will give list of all rainbow options. After you have compiled `libbow' and `rainbow', you can run the shell script `./demo/script' to see an annotated demonstration of the classifier in action. The web page http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/naive-bayes.html has a pointer to a "Gentle Introduction to Rainbow", as well as some sample UseNet text data. Rainbow improvements coming soon: Better documentation. Better modularily of command-line options for changing parameters of weight-setting methods. Incremental model training. Better smoothing. Good-Turing estimates, etc. Arrow ===== `Arrow' is a standalone program that does document retrieval. Sorry, there is no documentation yet.