Thu May 29 15:25:06 1997 Andrew McCallum * Version (BOW_MINOR_VERSION): Version 0.8. * bow/libbow.h (BOW_MINOR_VERSION): Version 0.8. * docnames.c (bow_map_filenames_from_dir): Remove local variables no longer used. Mon May 26 12:59:50 1997 Andrew McCallum * rainbow.c (main): New commented-out code for computing the number of word co-occurrences. Fri May 23 11:34:05 1997 Andrew McCallum * rainbow.c (USE_VOCAB_IN_FILE_KEY): New macro. (rainbow_options): New option "use-vocab-in-file". (rainbow_parse_opt): Handle it. (struct rainbow_arg_state): New member VOCAB_MAP. (rainbow_query): Use it to remove words from the vocabulary. (rainbow_test): Likewise. (main): Likewise. * rainbow-stats.pl (prune_from_classname): New global variable. A regular expression to be removed from the end of classnames before gathering stats on them. This allows us to gather stats on performance in the middle of class hierarchies. (read_trial): Use it. * int4str.c (bow_int4str_new_from_text_file): Return MAP instead of NULL! * barrel.c (bow_barrel_prune_words_not_in_map): Define MAX_WI and use it, so we don't ask for word indices larger than bow_num_words(). (bow_barrel_print_word_count): Also print word probability according to counts. * rainbow-h.c (main) [printing_word_counts]: Print word that is being counted. Wed May 21 15:01:51 1997 Andrew McCallum * barrel.c (bow_barrel_prune_words_not_in_map): Remove the words instead of hiding them, so that future bow_keep_top_words_by_infogain() calls won't unhide them. This version got 46% on hier/yahoo-science (dataset with a 10 document-per-class threshold). * rainbow-h.c (rainbowh_options): Added --use-vocab-in-file command-line option. (rainbowh_arg_state): Added PARENT and CI_IN_PARENT. Added HIER_LEAF. Removed printing of leaf- and intermediate-results. (hier_barrel_prob_wi_in_ci): New function. (check_prob_wi_in_ci): New function. (_hier_barrel_local_score): New function. (_hier_barrel_set_node_scores): Use it. (hier_barrel_print_infogain): Print FULL_NAME with interspersed spaces, so it won't get lexed by bow_int4str_new_from_text_file(). (main): Change defaults. Before populate_by_scoring=1 and hier_structure=hier_niece. Populate branches first thing, and check prob_wi_ci consistency. * naivebayes.c (bow_naivebayes_score_loo): Comment change. * int4str.c (bow_int4str_new_from_text_file): New function. * bow/libbow.h: Declare new functions. Tue May 20 16:02:24 1997 Andrew McCallum * barrel.c (bow_barrel_prune_words_not_in_map): New function. Mon May 19 09:52:09 1997 Andrew McCallum * rainbow-stats.pl (confusion): Calculate longest classname and use it to fix indentation. * wi2dvf.c (bow_wi2dvf_add_di_wv): Set SEEK_START to special flag 2. (bow_wi2dvf_add_wi_di_count_weight): Likewise. (bow_wi2dvf_hide_wi): Decrement WI2DVF->NUM_WORDS in the right place. (bow_wi2dvf_unhide_all_wi): Increment WI2DVF->NUM_WORDS. (bow_wi2dvf_write): Unhide all words first. (bow_wi2dvf_dv): Change assertion to deal with special flag 2. * rainbow.c (main): Pass new argument to bow_infogain_per_wi_print(). * rainbow-h.c: Misc changes. Print infogain during run. (hier_barrel_set_local_class_model): Add IS_ROOT argument. Unhide vocabulary after pruning by infogain, so lower levels get all words. * naivebayes.c (M_EST_M): New macro. (M_EST_P): New macro. (bow_naivebayes_score_loo): Use them to implement M-estimates, instead of old Laplace smoothing. * info_gain.c (bow_infogain_per_wi_print): Add FP argument. * bow/libbow.h: Add argument to infogain function. * barrel.c: Fix the math for assigning CDOC->PRIOR, and add assertion checks. Fri May 16 10:19:19 1997 Andrew McCallum This was state of code on Thursday night. * rainbow-h.c: Add options for changing population scheme and tree structure. Add ability to output intermediate and leaf results. * naivebayes.c (WORD_PRIOR_COUNT): New macro. Current value 1.0. (bow_naivebayes_score_loo): Use it. Thu May 15 16:22:27 1997 Andrew McCallum * rainbow.c (rainbow_test): Assert that the ACTUAL_NUM_HITS returned by bow_barrel_score() is the same as the NUM_HITS_TO_RETRIEVE requested. * split.c (bow_test_split): Use rand() properly so that the number of test documents in each class are not so biased. Add special code that *ensures* that the test documents are evenly distributed across classes. * rainbow.c (rainbow_print_weight_vector): Don't use CDOC->NORMALIZER if the method is "naivebayes", because NaiveBayes doesn't use it. Previously the printed values were bogus. Wed May 14 11:02:44 1997 Andrew McCallum * rainbow-h.c: -q RAINBOWH_QUERYING now seems to work. * naivebayes.c (bow_naivebayes_score_loo): Add assertion that CDOC->PRIOR is greater than zero. This restriction should be relaxed! * array.c (bow_array_free): Decrement length after testing for non-zero-ness, not before. Without this change, empty arrays would call free() on un-malloc'ed() memory. Tue May 13 18:16:31 1997 Andrew McCallum * rainbow-h.c: Add code for doing selective population of lower branches. This population seems to be working. Querying/scoring does not yet work. * wi2dvf.c (bow_wi2dvf_hide_wi): Change assertion to "if" so that we won't crash if we try to hide words that are already hidden. * split.c (bow_tmp_word_struct2): New type. (bow_model_next_wv): New function. (bow_nontest_next_wv): New function. * rainbow.c (rainbow_options): Fix documentation for test-files. (rainbow_test): Choose vocabulary by info gain *after* the test/train split. Add temporary code to test bow_naivebayes_score_loo(). Remove this later! * naivebayes.c (bow_naivebayes_score_loo): New function, copy of bow_naivebayes_score_loo, with extra code to do leave-one-out testing if argument LOO is non-negative. (bow_naivebayes_score): Call above function with -1 for LOO. (bow_method_naivebayes): Change NORMALIZE_WEIGHTS from bow_barrel_normalize_weights_by_summing() to NULL. The normalizing function was not taking account of the Laplace smoothing numbers, and was giving incorrect weights. (bow_method_crossentropy): Likewise. * istext.c (bow_fp_is_text): Increase NUM_LINE_LENGTHS to NUM_TEST_CHARS to avoid potential crash. * docnames.c (bow_map_filenames_from_dir): For directory names and filenames, make it use names of soft links, not the directories that the links point to. * barrel.c (bow_barrel_add_document): New function. * bow/libbow.h: Declare new function. * docnames.c (bow_map_filenames_from_dir): Change commented-out code so that, if uncommented, this function will work if you pass it a filename instead of a directory name. Tue May 6 15:30:30 1997 Andrew McCallum * Makefile.local (rainbow-h): Make it depend on libbow.a. * rainbow-h.c: May 5 changes from Andrew Ng. (rainbowh_unarchive): Switch order of unarchiving for vocabulary and hier_barrel. (hier_barrel_new_from_file): Use bow_barrel_new_from_data_file() instead of bow_barrel_new_from_fp(), so we close FILE*'s instead of keeping them open. Otherwise we run out of UNIX's available open file descriptor's. * wi2dvf.c (FREE_WHEN_HIDING_WI): New macro. (bow_wi2dvf_hide_wi): Heed it. (bow_wi2dvf_dv): Don't check to make sure that WI is less than bow_num_words(). Check SEEK_START before returning a non-NULL DV, because if SEEK_START is less than -1, the DV should be considered `hidden'. * opts.c (bow_exclude_filename): New global variable. (bow_options): New option "exclude-filename". (parse_bow_opt): Handle it. * docnames.c (bow_map_filenames_from_dir): Make sure BOW_EXCLUDE_FILENAME is non-NULL before passing it to strcmp(). * bow/libbow.h (bow_exclude_filename): Declare new global variable. * barrel.c (bow_barrel_set_cdoc_priors_to_class_uniform): Use bow_malloc() instead of alloca(), so that bow_realloc() will work. free() it at the end. (bow_barrel_new_from_data_file): New function. Mon May 5 21:08:34 1997 Andrew McCallum * rainbow-h.c: Changes by Andrew Ng, before Andrew McCallum's changes to close barrel FP's. Fri May 2 09:53:12 1997 Andrew McCallum * rainbow-h.c: Additions by Andrew Ng to implement cousin scheme. Wed Apr 30 10:48:30 1997 Andrew McCallum * Makefile.in: Include Makefile.local, avoiding error if it isn't present. * barrel.c (bow_barrel_keep_top_words_by_infogain): Unhide and hide the DVF's instead of removing them, so that we can call this function mulitple times with increasing NUM_WORDS_TO_KEEP. * wi2dvf.c (bow_wi2dvf_hide_wi): New function. (bow_wi2dvf_unhide_all_wi): New function. (bow_wi2dvf_dv): Handle new negative values of SEEK_START set by BOW_WI2DVF_HIDE_WI(). * bow/libbow.h: Declare new functions. (bow_doc_type): Add ignored_model, for rainbow-h.c. Thu Apr 24 09:03:10 1997 Andrew McCallum * vpc.c (bow_barrel_set_vpc_priors_by_counting): Fix crash that occurs if limited vocabulary causes all files in a class to be empty. * stoplist.c (bow_stoplist_add_word): New function. * rainbow-stats.pl (confusion): Print percentage correct for each category. * istext.c (bow_fp_is_text): Also return 0 for files that have more than 30% of their lines of the same length. This way we avoid files containing uuencoded blocks. * bow/libbow.h: Declare new function. Tue Apr 22 11:19:03 1997 Andrew McCallum * deflexer.c (bow_default_lexer): Add cast to initialization to avoid warning. Add a uniform, global way of keeping track of binary file format versions. * io.c (bow_file_format_version): New global variable. (bow_write_format_version_to_file): New function. (bow_read_format_version_from_file): New function. * bow/libbow.h (bow_file_format_version): Declare new global variable. (BOW_DEFAULT_FILE_FORMAT_VERSION): New macro. (bow_write_format_version_to_file): New function declaration. (bow_read_format_version_from_file): New function declaration. * rainbow.c (FORMAT_VERSION_FILENAME): New macro. (rainbow_archive): Write format version to disk. (rainbow_unarchive): Read it from disk if the file exists, otherwise set it to 3, which is the format version number of data before BOW_FILE_FORMAT_VERSION was added to the library. * rainbow.c (rainbow_options): New option "print-word-counts", alias for "print-counts-for-words". Hide the later option from the --help text. * rainbow-stats.pl (confusion): Print confusion matrix in a more readable format. Add new command-line option to rainbow for using only 0 or 1 word counts. * opts.c (bow_binary_word_counts): New global variable. (bow_options): New option "binary-word-counts". (parse_bow_opt): Handle it. * bow/libbow.h: Declare new global variable. * dv.c (bow_dv_add_di_count_weight): When BOW_BINARY_WORD_COUNTS is true, insist on keeping DV's entry count below 2, i.e. 0 or 1. Fri Apr 18 16:09:06 1997 Andrew McCallum * configure.in: Add -Wno-implicit to default CFLAGS. * rainbow.c (rainbow_lisp_query): Return if QUERY_WV is emtpy. (Previously would have crashed.) * tfidf.c (TFIDF_METHOD): Fix typo that defined _register_method_tfidf_.. functions without the last underscore. (Reported by Kamal Nigam.) * split.c (bow_test_split): When selecting documents for test set, and randomly pick a document that was already in the test set, don't just scan sequentially for the next non-test document, pick a new random number. This will avoid long contiguous stretches of test documents. * naivebayes.c (bow_naivebayes_score): Move the handling of SCORE_WITH_LOG_PROBABILITIES. * barrel.c (bow_barrel_set_cdoc_priors_to_class_uniform): Assert that CDOC->PRIOR must be greater or equal, not just greater. Thu Apr 10 14:54:08 1997 Andrew McCallum * rainbow-h.c: Fix the `compile-command'. (PRINT_TREE_SCORES): New macro. (hier_set_method): New function. (main): Call it if BOW_ARGP_METHOD is non-NULL. * deflexer.c (bow_default_lexer): Initialize it to -1, so that deflexer.o will get linked in under SunOS. Ug. See comment. * bow/libbow.h (bow_methods): Declare extern! Wed Apr 9 11:14:13 1997 Andrew McCallum * lex-html.c (bow_lexer_html_get_raw_word): Return last word in document, even if it is not followed by a non-word character! * lex-simple.c (bow_lexer_simple_get_raw_word): Likewise. * rainbow.c (rainbow_lisp_setup): Call all __attribute__((constructor)) functions here since this will be dynamically loaded and the contructor functions won't be called then. * opts.c (parse_bow_opt): Remove call to _bow_default_lexer_init(); moved to rainbow.c. Fix a bug whereby --skip-html was a no-op. * deflexer.c (bow_default_lexer_simple, bow_default_lexer_indirect, bow_default_lexer_gram, bow_default_lexer_html, bow_default_lexer_email): Change global variable from struct's to pointers to structs. (_bow_default_lexer_simple, _bow_default_lexer_gram, _bow_default_lexer_html, _bow_default_lexer_email): New static variables. (_bow_default_lexer_init): Set BOW_DEFAULT_LEXER_INDIRECT to point inside of BOW_DEFAULT_LEXER_GRAM, which is the BOW_DEFAULT_LEXER. * opts.c: Now use all default lexers as pointers to struct's instead of struct's. * bow/libbow.h (bow_default_lexer_simple, bow_default_lexer_indirect, bow_default_lexer_gram, bow_default_lexer_html, bow_default_lexer_email): Change global variable from struct's to pointers to structs. * vpc.c (bow_barrel_new_vpc_merge_then_weight): Assert the method name. * Makefile.in (dist-cmu, bow-$(BOW_VERSION).tar.gz): New targets. Tue Apr 8 08:00:00 1997 Andrew McCallum * Version (BOW_MINOR_VERSION): Version 0.7. * bow/libbow.h (BOW_MINOR_VERSION): Likewise. * rainbow.c (RAINBOW_MINOR_VERSION): Version 0.2. * arrow.c (ARROW_MINOR_VERSION): Version 0.2. * NEWS: Update for new version of library and rainbow. * readme.texi: Likewise. * Makefile.in (DIST_FILES): Add NEWS. * Makefile.in (dist): Fix invocation of `tr' for cvs rtag. * split.c (bow_test_next_wv): Initialize CURRENT_DI to avoid warning. * split.c (bow_test_split): Initialize DOC to avoid warning. * int4word.c (bow_words_keep_top_by_infogain): Initialize MAX_IG_WI to avoid warning. * dv.c (bow_dv_add_di_count_weight): Only give "overflowed short" message at BOW_VERBOSE level, not BOW_PROGRESS level. * crossbow.c (main): Initialize NORMALIZER to zero. * Makefile.in (dist): Create ./bow directory. Fix invocation of argp. (snapshot): Likewise. * configure.in: Add -O to the default CFLAGS. * rainbow.c (rainbow_options): Improve some option help text. (rainbow_parse_opt) [INFOGAIN_PAIR_VECTOR_KEY]: Handle it. * opts.c (bow_options): Improve some option help text. * Makefile.in (version.texi): Define BOWVERSION instead of BOW_VERSION, so makeinfo can get the value. (%.dvi, %.info): Fix typo. * libbow.texi: Fix typos and begin preliminary documentation. * rainbow.c (rainbow_options): New option "repeat"/'r'. (rainbow_parse_opt): Handle it. (rainbow_arg_state): New member REPEAT_QUERY. (rainbow_query): Attend to REPEAT_QUERY. * naivebayes.c (bow_naivebayes_set_weights): Fix assertion so it works for both naivebayes and crossentropy. Mon Apr 7 11:00:06 1997 Andrew McCallum * sarray.c (bow_sarray_entry_at_keystr): If there is no index for that KEYSTR, print an error message. This way if user mistypes a method name to rainbow's -m option, they get a message that makes some sense. * opts.c (_help_filter): New function to add the names of the available methods to the help text. (bow_argp): Put it in. Use strings to identify methods instead of integers. Separate method declarations instead separate .h files. * bow/tfidf.h, bow/naivebayes.h, bow/prind.h: New files. * Makefile.in (LIBBOW_H_FILES): Add files bow/naivebayes.h, bow/tfidf.h, bow/prind.h. * naivebayes.c (bow_method_naivebayes, bow_method_crossentropy): Use string method identifier instead of integer. * prind.c (bow_method_prind): Likewise. * tfidf.c (TFIDF_METHOD): Likewise. * rainbow.c (rainbow_parse_opt) [G]: Step through methods according to new BOW_METHODS bow_sarray, instead of old static array. * methods.c (bow_methods): Static array removed. (bow_methods): Renamed from _bow_str4method, and made non-static. * barrel.c (bow_method_id, _old_bow_methods): Put copies of what used to be in libbow.h here, so we can unarchive old-format barrel's. (BOW_DEFAULT_BARREL_VERSION): Changed from 2 to 3. (bow_barrel_new_from_data_fp): If VERSION_TAG is less than 3, read the method id integer and use _OLD_BOW_METHOD, otherwise, read a string and use new BOW_METHOD_AT_NAME(). (bow_barrel_write): Write the method as a string instead of as an integer. * Makefile.in (ALL_CPPFLAGS): -I$(srcdir) instead of -I$(srcdir)/bow. * All files: Include instead of "libbow.h". * bow/libbow.h: Include , , . (bow_method_register_with_name, bow_method_at_name): Declare functions. (bow_method_id): Typedef removed. (bow_str_to_method_id): Macro removed. (bow_methods): Global variable removed. (bow_method_tfidf_words, bow_method_tfidf_log_words, bow_method_tfidf_log_occur, bow_params_tfidf): Removed. (bow_method_prind, bow_params_prind): Removed. (bow_method_naivebayes, bow_params_naivebayes): Removed. * methods.c (bow_method_at_name): Comment function. (bow_method_register_with_name): Likewise. * opts.c (parse_bow_opt) [m]: Use bow_method_at_name(). * naivebayes.c: Use bow_method_register_with_name(). Add new method "crossentropy". (bow_naivebayes_score): Pay attention to SCORE_WITH_LOG_PROBABILITIES when setting class priors. When it is true, use inverse of cross-entropy instead of negative! * prind.c: Use bow_method_register_with_name(). * tfidf.c: Use bow_method_register_with_name(). * rainbow.c (main): Strip any trailing `/'s from classnames, so FILENAME_TO_CLASSNAME() will find the classnames. (Reported by Jason Rennie .) * rainbow-h.c (PRINT_COUNTS_FOR_WORD_KEY): New macro. (rainbowh_options): New option "print-counts-for-words". (rainbowh_parse_opt): Handle it. (struct rainbowh_arg_state): New member PRINTING_WORD. (hier_barrel_print_word_counts): New function. (main): Handle new option. Do the right think for `-O' if BOW_PRUNE_VOCAB_BY_OCCUR_COUNT_N. * info_gain.c (LEAVE_OUT_LAST_CLASS): Macro defined once at top. Changed from 0 to 1. * install.texi: Explain the results of --prefix. Remove old references to Objective C installation. Thu Apr 3 12:50:23 1997 Andrew McCallum * rainbow.c (rainbow_test_files): Use macros for setting QUERY_WV weights, so we can handle case in which the wv normalizer is NULL! (main): Replace code for implementing word-count-printing with call to new function. * barrel.c (bow_barrel_set_cdoc_priors_to_class_uniform): Initialize ci2dc entries to zero! (bow_barrel_print_word_count): New function. * opts.c (bow_options): Add new option "naivebayes-score-with-log-probs". (parse_bow_opt): Handle it. * naivebayes.c (bow_naivebayes_score): Begin adding code to support SCORE_WITH_LOG_PROBABILITIES parameter; not yet finished. (bow_naivebayes_params): Add initializer for SCORE_WITH_LOG_PROBABILITIES, initialize it BOW_NO. * bow/libbow.h: Declare new function. (bow_params_naivebayes): New entry SCORE_WITH_LOG_PROBABILITIES. Wed Apr 2 10:07:30 1997 Andrew McCallum * configure.in: Add a check to see if __attribute__((constructor)) works. If it does not, define CONSTRUCTOR_FAILS. * rainbow.c (rainbow_lisp_setup): Fix typo. * Makefile.in ($(PERL_RUNNABLE_FILES)): Use % in pattern and $< in rule so that we get the .pl file from the $(srcdir). * rainbow-h.c (rainbowh_options): New option "print-infogain-vector", 'I'. (struct rainbowh_arg_state): Add state for it. (rainbowh_parse_opt): Handle it. (hier_barrel_write_to_file): Close the FP after writing a barrel. (hier_barrel_set_vpc_with_weights): Construct and pass a CLASSNAMES array. (hier_barrel_set_cdoc_priors_to_class_uniform): New function. (_hier_barrel_set_node_scores): Print a little header/separator if BOW_PRINT_WORD_SCORES. (hier_barrel_test): Initialize the QUERY_WV to NULL, so BOW_TEST_NEXT_WV doesn't try to free unallocated memory. (hier_barrel_print_infogain): New function. (rainbowh_archive): New function. (rainbowh_unarchive): New function. (main): Use above two functions. Deal with printing infogain. * rainbow.c: Re-written for using libargp. This should make it work with the WebKB lisp crawler again. * prind.c (bow_prind_score): Make sure CDOC->FILENAME is non-NULL before trying to print it when BOW_PRINT_WORD_SCORES is true. * opts.c (parse_bow_opt) [ARGP_KEY_INIT]: Call _bow_default_lexer_init(). * deflexer.c (_bow_default_lexer_init): Don't make it static. Use static local variable to make sure we don't run through it twice. This is because we will call is explicitly in opts.c:parse_bow_opt(), because __attribute__ ((constructor)) doesn't seem to work on SunOS. * Makefile.in (PERL_FILES): Added rainbow-ac.pl and rainbow-pr.pl. * (rainbow-ac.pl, rainbow-pr.pl): New files from Dayne Freitag . Tue Apr 1 10:11:03 1997 Andrew McCallum * rainbow-h.c (rainbowh_parse_opt): Implement option 'M' for use_maximum_likelihood_path. (hier_default_method): Renamed from METHOD; all uses changed. (hier_barrel): New member NUM_NON_REST_CDOCS, to keep track of DOC_BARREL->CDOCS->LENGTH *before* the `rest' documents start getting added, so that we can implement HIER_PARENT_DI_TO_CHILD_INDEX_AND_DI properly. (hier_barrel_new): Initialize it to -1. (hier_barrel_add_child): Set it. (hier_barrel_new_from_text_dir_leaf): Set it. (hier_barrel_write_to_file): Write it. (hier_barrel_new_from_file): Read it. (hier_parent_di_to_child_index_and_di): Use it. (hier_barrel_print): Print it instead of DOC_BARREL->CDOCS->LENGTH. (hier_barrel_add_stats): New function split out from HIER_BARREL_ADD_CHILD. (hier_barrel_add_child): Use it. (hier_barrel_add_rest): New function. (hier_barrel_new_from_text_dir): Call it to add `rest' documents. (hier_barrel_test): Allocate space for 3 as many SCORES, to make room for the `rest' classes. (main): Set HIER_DEFAULT_METHOD from BOW_ARGP_METHOD, if non-NULL. * scale.c (bow_barrel_scale_weights_by_given_infogain): Only verbosify every 100 words. (bow_barrel_scale_weights_by_given_foilgain): Likewise. * vpc.c (bow_barrel_set_vpc_priors_by_counting): Fix indentation. * rainbow-h.c: Converted to do command-line argument processing with libargp. * opts.c (bow_options): Remove "version" 'V' option. libargp can handle that automatically. (_print_version): New function to print both program version and library version. (argp_program_version_hook): Set it to _PRINT_VERSION(). * rainbow.c (rainbow_print_usage): Function removed. Libargp does that now. Mon Mar 31 11:07:30 1997 Andrew McCallum * barrel.c (bow_barrel_set_cdoc_priors_to_class_uniform): Use ALLOCA() instead of BOW_MALLOC() to avoid memory leak. * Makefile.in (configure, config.status): Sprinkle with $(srcdir). * configure.in: Move the setting of CFLAGS above AC_PROC_CC, so that it will have an effect. * install.texi: Mention how to set CPPFLAGS in the ./configure line. * vpc.c (bow_barrel_set_vpc_priors_by_counting): Properly set the CDOC->PRIOR's. * rainbow.c (INFOGAIN_PAIR_VECTOR_KEY): New macro. (rainbow_options): New option "infogain-pair-vector". (rainbow_parse_opt): Handle it. (main): Likewise. When RAINBOW_WORD_COUNT_PRINTING, also print the total number of words in each class. * prind.c (bow_prind_set_weights): Get MAX_WI from MIN of WI2DVF->SIZE and BOW_NUM_WORDS(), not just BOW_NUM_WORDS(). * opts.c (bow_uniform_class_priors): New global variable. (bow_options): New option "uniform-class-priors". (parse_bow_opt): Handle it. * naivebayes.c (bow_naivebayes_set_weights): Get MAX_WI from MIN of WI2DVF->SIZE and BOW_NUM_WORDS(), not just BOW_NUM_WORDS(). (bow_naivebayes_score): Pay attention to BOW_UNIFORM_CLASS_PRIORS. Don't sum in score of words that don't have a DV entry! Previously we were allowing words that `aren't in the vocabulary' of the BARREL to contribute! This was wrong. They were contributing according to the Laplace Estimators, and classes with larger numbers of words were getting penalized. * info_gain.c (bow_infogain_per_wi_new): Sum floating point CDOC->PRIOR's instead of increment integer count of documents, so that infogain can be calculated from documents with different `weights'. (bow_infogain_per_wi_new_using_pairs): New function. For now it prints its results instead of returning them. * barrel.c (bow_barrel_set_cdoc_priors_to_class_uniform): New function. * bow/libbow.h: Declare new functions. Mon Mar 31 11:56:48 1997 Andrew McCallum * Makefile.in (CFLAGS, CPPFLAGS): Get values from configure. * configure.in: Do AC_SUBST() for CPPFLAGS and CFLAGS. Fri Mar 28 10:28:26 1997 Andrew McCallum * rainbow-h.c: Fix spelling: "heir" -> "hier". How embarrassing! * dv.c (bow_dv_new_from_data_fp): Fix typo in feof() assertion. (Reported by Doreen Cheng .) * rainbow.c (PRINT_COUNTS_FOR_WORD_KEY): New macro. (rainbow_options): New option "print-counts-for-word". (rainbow_parse_opt): Handle it. (main): Implement it. * bow/libbow.h: (bow_wi2dvf): Add new element to structure: `num_words'. (bow_barrel): Put `is_vpc' at end of structure instead of the beginning. * wi2dvf.c (bow_wi2dvf_new): Initialize NUM_WORDS. (bow_wi2dvf_add_di_wv): Increment it. (bow_wi2dvf_add_wi_di_count_weight): Likewise. (bow_wi2dvf_new_from_data_fp): Likewise. (bow_wi2dvf_remove_wi): Decrement it. (bow_wi2dvf_print_stats): Print it. * prind.c (bow_prind_set_weights): Use BARREL->WI2DVF->SIZE and BARREL->WI2DVF->NUM_WORDS instead of BOW_NUM_WORDS(). In particular, this will allow us to set the Laplace estimators using the correct number of words in the barrel, not the arbitrary libbow-wide vocabulary size. Properly use CDOC->WORD_COUNT instead of overloading CDOC->NORMALIZER. (bow_prind_score): Likewise use BARREL->WI2DVF->SIZE and BARREL->WI2DVF->NUM_WORDS instead of BOW_NUM_WORDS(). (bow_print_word_scores): Removed to opts.c. * opts.c (bow_print_word_scores): Global variable moved here from prind.c. (bow_options): New option "print-word-scores". (parse_bow_opt): Handle it. * naivebayes.c (bow_naivebayes_set_weights): Use BARREL->WI2DVF->SIZE and BARREL->WI2DVF->NUM_WORDS instead of BOW_NUM_WORDS(). In particular, this will allow us to set the Laplace estimators using the correct number of words in the barrel, not the arbitrary libbow-wide vocabulary size. (bow_naivebayes_score): Likewise, and add code to print scores contributions of each word with BOW_PRINT_WORD_SCORES is non-NULL. (SCORE_WITH_LOG_PROBABILITIES): New macro. * barrel.c (bow_barrel_printf): Comment out the code that would skip over documents that are not of type `model'. Thu Mar 27 11:29:34 1997 Andrew McCallum * rainbow-stats.pl: Make output labels more descriptive. Say `average percentage accuracy'. * split.c (bow_test_split): Use the micro-seconds field from gettimeofday() instead of time() to set the random number generator seed. Otherwise, if we re-call this function too quickly we'll get exactly the same seed! ...because time() returns a number of seconds. * demos/script: New shell script file that will demo rainbow, with running commentary. * demos/data: New directory containing 20 articles 2 newsgroups. This is for use with demos/script. * install.texi: Remove mention of `checks' and `examples' directory; they don't exist. (Reported by Doreen Cheng .) Mon Mar 24 12:07:53 1997 Andrew McCallum * Makefile.in (rainbow-lisp.o): Use $(ALL_CPPFLAGS) and $(ALL_CLFAGS) instead of non-ALL versions. * rainbow.c (rainbow_lisp_setup): Rewrite for use with libargp. * methods.c (bow_method_at_name): Fix typo. (bow_method_at_index): Likewise. * opts.c (parse_bow_opt): Use 'g' instead of 'N' for setting gram size. * rainbow.c (rainbow_lisp_query): Free the QUERY_WV before returning! * methods.c (bow_method_register_with_name): New function. (bow_method_at_name): New function. * arrow.c (PRINT_IDF_KEY): New macro. (arrow_options): Add new option "print-idf". (struct arrow_arg_state): New enum ARROW_PRINTING_IDF. (arrow_index): Prune the vocabulary if BOW_PRUNE_VOCAB_BY_OCCUR_COUNT_N is non-zero. (main): Add code to print idf values. * lex-simple.c (bow_alpha_lexer, bow_alpha_only_lexer, bow_white_lexer): Initialize STEM_FUNC to 0 instead of BOW_STEM_PORTER. * tfidf.c (bow_tfidf_set_weights): Comment out code that sets total_word_count. Do the DF_TRANSFORM on DF, not on IDF! Otherwise we get negative IDF's. * rainbow-h.c (use_maximum_likelihood_path): New global variable. (_heir_barrel_set_node_scores): Use it. (main): Set it when -M passed on command line. (num_top_words): Moved from main-local variable to global. (heir_barrel_test): Reduce vocab by infogain. Fri Mar 21 14:02:39 1997 Andrew McCallum * bow/libbow.h (bow_lexer_simple): Add entry TOSS_WORDS_LONGER_THAN. (bow_wv_set_weights_to_count_times_idf): Declare new function. * wv.c (bow_wv_set_weights_to_count_times_idf): New function. * tfidf.c (bow_tfidf_set_weights): Comment out code saying that TFIDF is broken. Rewrite the way IDF is calculated. (bow_tfidf_score): Set and normalize the QUERY_WV weights here (even though it is redundant) so that we can properly use the IDF from the BARREL when normalizing weights. Normalize the QUERY_WV weight when incrementing CURRENT_SCORE. * prind.c (bow_prind_set_weights): Skip a document if it does not of type model, both when setting NORMALIZER and TOTAL_TERM_COUNT, and when setting weights. (bow_prind_score): Skip a document if it does not of type model. * lex-simple.c (bow_lexer_simple_postprocess_word): Add code to toss words longer than SELF->TOSS_WORDS_LONGER_THAN. Set WORDLEN at beginning. It appeared that it was getting used uninitialized before! (bow_alpha_lexer, bow_alpha_only_lexer, bow_white_lexer): Add value for new field TOSS_WORDS_LONGER_THAN. * opts.c (APPEND_STOPLIST_FILE_KEY): New macro. (bow_options): Added "append-stoplist-file" (parse_bow_opt): Handle new option. * int4str.c (_str2id): Return the absolute value of the old return value. Sometimes with really long strings, the return value was going negative. (_str_hash_lookup): Assert that ID is non-negative. Thu Mar 20 11:47:49 1997 Andrew McCallum These changes by Karl Kleinpaste * int4word.c (bow_words_reread_from_file): Use fopen() instead of bow_fopen(), so we are sure not to call abort(). * wv.c (bow_wv_sprintf): Fix function to account for length troubles properly. (bow_wv_sprintf_words): New function, prints the words themselves, rather than the word indices. * bow/libbow.h: Declare new function. * naivebayes.c (bow_naivebayes_set_weights): Add commented-out code that forces all counts to either 0 or 1. This was used on some experiments with Shumeet. * lex-html.c (bow_lexer_html_get_raw_word): Add a ! to the FALSE_TO_END condition test, so we don't end the tokenization too early. Tue Mar 18 14:47:35 1997 Andrew McCallum * rainbow.c (rainbow_parse_opt) [ARGP_KEY_END]: Print a useful error when only one classname is given. (main): Check for rainbow_infogain_printing properly. * opts.c (parse_bow_opt) [ARGP_KEY_END]: Check for the existance of BOW_DATA_DIRNAME in a way that works even when the directory is owned by someone else. * bow/libbow.h (bow_fread_string): Assert that the string length is non-negative. * barrel.c (_bow_barrel_version): New variable. (BOW_DEFAULT_BARREL_VERSION): New macro. (bow_barrel_new_from_data_fp): Read the version number instead of a null_tag. (bow_barrel_write): Likewise, for writing. * arrow.c (main): Remove redundant code that is now in opts.c. Mon Mar 17 12:09:32 1997 Andrew McCallum * Makefile.in (%.o:%.c): Fix the order on this pattern rule. ($(DEMO_EXECUTABLES):%:%.o): Put $(DEMO_EXECUTABLES) at the beginning of this pattern, so it matches only those files. * arrow.c: Don't include getopt.h; we're using argp.h instead. (arrow_index): Fix typo. * configure.in: Don't look for getopt.h anymore. We don't need it now that we are using libargp. * configure.in: AC_INIT looking for int4str.c instead of libbow.h. * Makefile.in (%): Use this pattern to make DEMO_EXECUTABLES instead of listing them all. This avoids making all the .o's for one of the DEMO_EXECUTABLES. * rainbow.c: Converted to use argp command-line argument processing. * opts.c (bow_argp_method): Renamed from bow_default_method. (parse_bow_opt) [ARGP_KEY_INIT]: Add words to stoplist. * deflexer.c (_bow_default_lexer_init): Initialize bow_default_lexer to BOW_DEFAULT_LEXER_GRAM, not BOW_LEXER_GRAM! * bow/libbow.h (bow_argp_method): Renamed from bow_default_method. * arrow.c (arrow_parse_opt) [q]: Set query.filename. (arrow_index): BOW_DEFAULT_METHOD renamed to BOW_ARGP_METHOD. * arrow.c (arrow_index): Set the method according to BOW_DEFAULT_METHOD. * opts.c: Fleshed out into first working version. * error.c: Comment fix. Include libbow.h and stdio.h. * deflexer.c (_bow_default_lexer_init): New constructor function. (bow_default_lexer_simple, bow_default_lexer_indirect, bow_default_lexer_gram, bow_default_lexer_html, bow_default_lexer_email): New variables, default instantiations of lexers. * bow/libbow.h: Add argp declarations. (bow_argp_children): New variable. (bow_prune_vocab_by_infogain_n): New variable. (bow_prune_vocab_by_occur_count_n): New variable. (bow_default_method): New variable. (bow_data_dirname): New variable. * arrow.c: Convert to using argp for command-line processing. * Makefile.in: Change all instances of `libbow.h' to `bow/libbow'. (includedir): Add `/bow' to end. (LIBBOW_C_FILES): Add opts.c. (ALL_CPPFLAGS): add -I$(srcdir)/bow and -I$(srcdir)/argp. (rainbow-lisp.o): Use $< instead of rainbow.c, so VPATH will find it when compiling in a different directory than the source. * bow/libbow.h (STRINGIFY): New macro. (bow_default_lexer_simple, bow_default_lexer_indirect, bow_default_lexer_gram, bow_default_lexer_html, bow_default_lexer_email): Declare default instantiations of lexers. Fri Mar 14 11:01:14 1997 Andrew McCallum * Makefile.in (LIBBOW_C_FILES): Renamed defparser.c to deflexer.c. * deflexer.c: Renamed from defparser.c. Add the `argp' subdirectory, and incorporate it into the Makefile. * HACKING: Add argp autoconf instruction. * configure.in: Call AC_CONFIG_SUBDIRS to configure argp also. * Makefile.in (ALL_LIBS): Move it closer to $(DEMO_EXECUTABLES) target. $(DEMO_EXECUTABLES): Make this target depend on argp/libargp.a. (install): Call make install in argp directory also. (dist, snapshot): Call make in argp directory to include its files too. Wed Mar 12 20:00:27 1997 Andrew McCallum * Makefile.in (CPPFLAGS): Don't include $(DEFS) here, it's now in ALL_CPPFLAGS. * Makefile.in (ALL_CPPFLAGS): New variable. (ALL_CFLAGS): New variable. (.c.o): New pattern rule that uses above new variables. Now Kamal can safely type `make CPPFLAGS=-DNDEBUG'. * rainbow-h.c (_heir_barrel_set_node_scores): Don't threshhold the scores to 0/1. (strdup): New function. Implement this local version to help with debugging. Consider removing it later. * libbow.h (bow_params_prind): Remove variable SCALE_BY_FOILGAIN. It isn't needed since we have a function pointer for it in BOW_METHOD. * prind.c (bow_prind_params): Remove BOW_NO for scaling. * rainbow.c (rainbow_lisp_setup): Remove setting of BOW_PRIND_SCALE_BY_INFOGAIN; it now defaults to on. (rainbow_print_usage): Change the sense of -G. It now turns off foilgain scaling, instead of turning on. (Actually, it was the default before this anyway.) (main): Given -G, zero-out the SCALE_WEIGHTS entries in all the methods. Tue Mar 11 11:58:03 1997 Andrew McCallum * Version (BOW_MINOR_VERSION): Version 0.6. * libbow.h: Likewise. * Makefile.in (DIST_FILES): Add TODO. Remove p.inc. (p-alpha.o, p-alonly.o, p-white.o): Targets removed. * rainbow.c (rainbow_query): Use bow_barrel_ macros instead of indexing into the methods structure manually. * crossbow.c: Add copyright info. * readme.texi: Fill out. * libbow-desc.texi: Add description. * install.texi: Add pointer to the README. Say that it requires GCC. * HACKING: Update CVS repository machine name. * tfidf.c (bow_tfidf_set_weights): Insert dislaimer explaining that TFIDF is broken. Tue Mar 11 11:31:39 1997 Rahul Sukthankar * Makefile.in (DEMO_C_FILE): Added crossbow.c. * crossbow.c: New file. Mon Mar 10 18:52:03 1997 Andrew McCallum * int4str.c (_str2id): Keep return value smaller using modulus. This fixes bug Rosie Jones encountered with negative hash values. (_str_hash_lookup): Assert that H is non-negative. * Makefile.in (LIBBOW_C_FILES): Added lex-email.c. Fri Mar 7 10:54:09 1997 Andrew McCallum * int4word.c (bow_words_reread_from_file): Make sure LAST_FILE is non-NULL. Tue, 18 Feb 1997 20:15:42 -0500 Jason Rennie * lex-email.c: New file. Created lexer for e-mail/newsgroup messages * lex-html.c: Changed code to allow words separated by HTML tags to be tokenized as single words. Big is now tokenized as "Big". Nested brackets are now ignored. This should more closely model the way HTML is interpreted. * rainbow.c: Added rainbow_email_lexer as a bow_lexer_indirect. Added '-M' option to allow user to make use of rainbow_email_lexer. rainbow_email_lexer will remove "Newsgroups:" and "Path:" headers from message. * libbow.h (bow_email_headers_to_remove): Declare new global variable. (bow_email_lexer): Likewise. Tue Mar 4 11:51:53 1997 Andrew McCallum * libbow.h (bow_barrel_scale_weights): Don't call underlying function if it's NULL. (bow_barrel_normalize_weights): Likewise. (bow_wv_set_weights): Likewise. (bow_wv_normalize_weights): Likewise. * vpc.c (bow_barrel_new_vpc_merge_then_weight): Use macros for weight setting. (bow_barrel_new_vpc_weight_then_merge): Likewise. * rainbow-h.c (_heir_barrel_cdoc_write): Write WORD_COUNT. (_heir_barrel_cdoc_read): Read it. (heir_dir_is_leaf): Check the return status from CHDIR(), and print appropriate error message. (heir_barrel_keep_top_words_by_infogain): Return immediately if num_words_to_keep is 0 or the children count is 0. (heir_barrel_set_vpc_with_weights): Return immediately if the children count is 0. (_heir_barrel_set_node_scores): Add temporary #if'ed code to make score either 1 or 0, so winner takes all. (heir_barrel_score_recurse): New argument DEPTH. All callers changed. (main): Change default NUM_TOP_WORDS from 3000 to 0. Add new command line argument -m and -N. Changes made with Sean Slattery. * naivebayes.c (bow_naivebayes_set_weights): Store class-wide word count in CDOC->WORD_COUNT instead of overloading CDOC->NORMALIZER. (bow_naivebayes_score): Use CDOC->WORD_COUNT instead of CDOC_NORMALIZER. Use it to fix PR_W_C in case where that word doesn't appear in the class. Instead of (1.0 / MAX_WI) use (1.0 / (MAX_WI + CDOC->WORD_COUNT)). Don't normalize the weight by CDOC->NORMALIZER because it it set to already by normalized correctly, including the words that don't appear in the in class. (bow_method_naivebayes): Change the weight normalizing function from BOW_NORMALIZE_WEIGHTS_BY_SUMMING to NULL, because we don't use CDOC->NORMALIZER anymore. * libbow.h (bow_cdoc): Add WORD_COUNT. * barrel.c (_bow_barrel_cdoc_write): Write WORD_COUNT. (_bow_barrel_cdoc_read): Read it. * vpc.c (bow_barrel_new_vpc): Assert MAX_CI is positive, otherwise this means we didn't find any classes. Wed Feb 26 11:08:50 1997 Andrew McCallum * HACKING: Fix sandbox's name. Wed Feb 19 11:27:55 1997 Andrew McCallum * barrel.c (bow_barrel_keep_top_words_by_infogain): Return immediately if NUM_WORDS_TO_KEEP is 0. Tue Feb 18 13:39:34 1997 Andrew McCallum * libbow.h (bow_str_to_method_id): Use a temporary variable, to we use statements like ARGI++ as an argument. * info_gain.c (bow_infogain_per_wi_new): Change assertion to handle round-off error. * rainbow-h.c: Include , . (heir_barrel): Add components INDEX_IN_PARENT, NUM_LEAVES, FULL_NAME. (heir_barrel_new): Set them. (heir_dir_is_leaf): Use chdir() so that symlinks are dealt with properly. Free() the results of scandir(). (heir_barrel_new_from_text_dir_leaf): Set FULL_NAME and add assertions. (_heir_barrel_new_from_text_dir_recurse): New parameter PARENT_NAME. Move the chdir() to handle symlinks properly. Don't make a SUBDIRNAME. (heir_barrel_new_from_text_dir): New function. (heir_barrel_write_to_file): Write new heir_barrel components. (heir_barrel_new_from_file): Read them. (heir_barrel_free): Free FULL_NAME. (heir_barrel_keep_top_words_by_infogain): New function. (heir_parent_di_to_child_index_and_di): New function. (heir_di_to_classname): New function. (heir_barrel_test_split): New function. (_heir_barrel_set_node_scores): Use bow_barrel_score() instead to bow_get_best_matches(). (heir_barrel_print_scores_recurse): Return void not int. Print all on same line. (heir_barrel_score_recurse): New function. (heir_barrel_score): New function. (heir_barrel_test): New function. (heir_barrel_print_weight_vectors): Change formatting. (set_vocabulary_from_file): New function (unused). (main): Allow user to set DATADIR (-d) and NUM_TOP_WORDS (-T), test (-t). Compile with -Wall. Mon Feb 17 10:36:32 1997 Andrew McCallum * configure.in: Remove check for , all ANSI compilers should have it. * split.c: Remove SunOS declarations of rand() and srand(). (RAND_MAX): Define macro, if not already defined. These two changes needed to compile on SunOS. * naivebayes.c (bow_naivebayes_set_weights): Uncomment assertion about METHOD->ID. * rainbow.c (rainbow_lisp_setup): Add `-N' to effective arguments. (rainbow_lisp_query): Fix typo in BOW_FOPEN() call. * rainbow.c (rainbow_query): Check for QUERY_WV being NULL, and output more useful messages in that case. * naivebayes.c (bow_naivebayes_score): Rearrange the code for stepping through a DV so we always get the CDOC. This change should have no effect on the outcome. * lex-simple.c (bow_lexer_simple_open_text_fp): Fix test for matching END_PATTERN_PTR. Don't push the DOCUMENT_END_PATTERN back on the input stream after we find it; this is a stylistic choice. * docnames.c (bow_map_filenames_from_dir): Pass relative instead of absolute directory names to recursive calls. Before I was having trouble with symbolic links. This seems to fix it. * int4word.c (bow_words_keep_top_by_infogain): Fix assertions; its OK to have infogain equal to 0. * prind.c: Comment fixes. * foilgain.c (bow_foilgain_per_wi_ci_new): Use malloc() for POS_PER_WI_CI and NEG_PER_WI_CI, instead of using stack. We were overflowing the stack before. Tue Feb 11 12:15:30 1997 Andrew McCallum * naivebayes.c (bow_naivebayes_score): When word doesn't appear in the class vector, make Pr(w|C) include CDOC->NORMALIZER. (Suggested by Sean Slattery). * naivebayes.c (bow_naivebayes_score): Fix constant in assertion. * configure.in: When perl5 isn't found, PERL will be "", not ":". Deal with it properly. * libbow.h: Don't bother with HAVE_FLOAT, just always include . (bow_get_best_matches): Remove declaration. The function no longer exists. * rainbow.c (rainbow_test): Use macros for accessing method functions. * split.c: Fix author comment. * info_gain.c (bow_infogain_per_wi_new): Use double instead of float, because before we were loosing resolution and getting negative IG's. (bow_entropy): Likewise. Mon Feb 10 16:25:04 1997 Andrew McCallum * docnames.c (bow_map_filenames_from_dir): Use perror() when can't open directory. Fri Feb 7 11:00:50 1997 Andrew McCallum * int4word.c (bow_words_read_from_file): Fix typo. * rainbow.c (rainbow_lisp_query): Use bow_barrel_score instead of bow_get_best_matches. These changes by Tony Brusseau , with modifications by . * wv.c (bow_wv_new_from_text_string): New function. (bow_wv_sprintf): New function. * int4word.c (bow_words_set_map): Add new argument indicating if old map should be freed. All callers changed. (bow_words_reread_from_file): New function. * docnames.c: Include . (bow_map_filenames_from_dir): Add WindowsNT backslashes to first assertion. * libbow.h: Declare new functions. Thu Feb 6 18:33:21 1997 Andrew McCallum * rainbow.c: Updated for below library changes. * arrow.c: Likewise. * libbow.h: Declare many new functions, variables and types, including: (bow_boolean): New type. (bow_wv_set_weights_to_count): New function declaration. (bow_wv_normalize_weights_by_vector_length): Likewise. (bow_wv_normalize_weights_by_summing): Likewise. (bow_str_to_method_id): Macro renamed from bow_str2method. (bow_method_id): New enum, replacing bow_method. (bow_method): Now a struct. (bow_barrel_set_weights, bow_barrel_scale_weights, bow_barrel_normalize_weights, bow_new_vpc_with_weights, bow_barrel_score, bow_wv_set_weights, bow_wv_normalize_weights): New macros. (bow_methods): New global variable declaration. (bow_params_*): New types. (bow_score): Renamed from bow_doc_score. * wv.c (bow_wv_set_weights_to_count): New function. * weight.c: File removed. * vpc.c (bow_barrel_new_vpc_merge_then_weight): New function. (bow_barrel_new_vpc_weight_then_merge): New function. (bow_barrel_set_vpc_priors_by_counting): Renamed from _bow_barrels_set_naivebayes_vpc_priors. * tfidf.c: File contents totally replaced to implement TFIDF. Functions removed from weight.c and other places. (bow_tfidf_set_weights): Function renamed. (bow_tfidf_score): Function renamed from bow_get_best_matches(). (bow_tfidf_params_{words,log_words,log_occur}): New variables. (bow_method_tfidf_{words,log_words,log_occur}): New global variables. * prind.c (bow_prind_uniform_priors): Global variable removed. (bow_prind_scale_by_infogain): Likewise. (bow_prind_normalize_scores): Likewise. (bow_prind_set_weights): Renamed from _bow_barrel_set_prind_weights. (bow_prind_score): Renamed from _bow_score_prind_from_wv, and updated for library changes. (bow_prind_params): New variable. (bow_method_prind): New global variable. * score.c: File removed. * naivebayes.c (_bow_barrels_set_naivebayes_vpc_priors): Function removed. Replacement in vpc.c. (bow_naivebayes_set_weights): Minor updates for library changes. (bow_naivebayes_params): New variable. (bow_method_naivebayes): New global variable. * info_gain.c (bow_barrel_scale_weights_by_info_gain): Function removed. Replacement is now in scale.c. (bow_barrel_scale_weights_by_foilgain): Likewise. (bow_foilgain_per_wi_ci_new): Likewise. Replacement now in foilgain.c. (bow_foilgain_free): Likewise. * barrel.c (bow_barrel_new): Make the default method naivebayes, instead of tfidf_log_occur. (bow_barrel_new_from_data_fp): Get the METHOD pointer from BOW_METHODS. (bow_barrel_write): Write the ID. * Makefile.in (CPPFLAGS): Add $(DEFS). (ALL_INCLUDE_FLAGS, ALL_CPPFLAGS, ALL_CFLAGS, ALL_LDFLAGS): Variables removed. (LIBBOW_C_FILES): Added foilgain.c, methods.c, normalize.c, scale.c, tfidf.c. Removed score.c, weight.c. (DEMO_EXECUTABLES): Don't use ALL_LDFLAGS. Tue Feb 4 14:21:08 1997 Andrew McCallum * split.c (bow_test_split): Properly deal with the fact the rand() returns an int between 0 and RAND_MAX, and the previously-use drand48() returned a double between 0 and 1. Mon Feb 3 12:44:18 1997 Andrew McCallum Following changes made by Tony Brusseau for WindowsNT compatibility. * configure.in: Check for . * libbow.h: Include float.h if we have it; otherwise include values.h and redefine its macros. (htonl, htons, ntohl, ntohs): Temporarily define as identity for WindowsNT. (bow_fwrite_string): Cast sizeof() to int. * split.c: Use rand() and srand() instead of drand48() and srand48(). * rainbow.c (rainbow_test_files): Make DIRLEN unsigned, to avoid warning under WindowsNT. * primes.c: Use unsigned int's instead of int's in several places, to avoid warnings under WindowsNT. * dv.c: Don't include . Use -style MAX'es. * naivebayes.c: Use -style MAX'es. * int4word.c: Likewise. * configure.in: Look for the wsock32 library. * prind.c (_bow_score_prind_from_wv): Don't die if SCORES_SUM is zero; just leave zero scores on all classes. * rainbow.c (main): Add `s' to getopt call. (rainbow_index): Add newline to end of "No text files" message. Fri Jan 31 11:28:54 1997 Andrew McCallum * Makefile.in (rainbow-lisp.o): New target. * rainbow.c (rainbow_lisp_setup): New function. (From Kamal.) Surround this and rainbow_lisp_query by #if RAINBOW_LISP. (main): Surround by #if !RAINBOW_LISP. * HACKING: Change the cvsroot in directions for networked `pserver' use. * libbow.h: If under WinNT, include , otherwise include . * bitvec.c (BITSPERBYTE): Surround it with an #ifndef. * prind.c: Don't include . * naivebayes.c: Likewise. * int4word.c: Likewise. * dv.c: Likewise. * bitvec.c: Likewise. (BITSPERBYTE): New macro. Wed Jan 29 10:10:11 1997 Andrew McCallum * rainbow.c (rainbow_lisp_query): New function. * Makefile.in (maintainer-clean): Add config.* files. * configure.in: Add quotes around $PERL so test will still work if $PERL is empty. * rainbow-h.c (_heir_barrel_set_node_scores): Add some temporary progress printing. (heir_barrel_print_scores_recurse): Renamed. (heir_barrel_print_scores): New function. (heir_barrel_print_foilgain): New function. (heir_barrel_print_weight_vectors): New function. (main): Add new options for calling new functions. * prind.c (_bow_score_prind_from_wv): Add three checks for NaN. * int4word.c (bow_words_write_to_file): New function. (bow_words_read_from_file): New function. * libbow.h: Declare new word functions. * rainbow-h.c (method): New global variable. (heir_barrel): Added component SCORE. (heir_barrel_new_from_text_dir_leaf): Set doc_barrel method. (heir_barrel_new_from_text_dir): Likewise. (main): Take -i and -q arguments. * libbow.h (bow_fwrite_string): Change type of LEN from int to short. * vpc.c (bow_barrel_new_vpc): Don't abort if CLASSNAMES is NULL. Tue Jan 28 15:49:46 1997 Andrew McCallum * rainbow-h.c: New file. * barrel.c (bow_barrel_write): Handle case in which BARREL is NULL. (bow_barrel_new_from_data_fp): Likewise. (xxx Although there is some strangeness with FGETC returning -1, which I am currently ignoring. I should look at this again...) * defparser.c (bow_default_lexer): Temporary fix to confusion about constant initializers and pointers. * Makefile.in: Fix copyright. * barrel.c: Fix header comment. (bow_barrel_printf): Print the word as well as the word index. * docnames.c (bow_map_filenames_from_dir): Remove commented-out code for checking whether the file contains text. Thu Jan 23 13:47:01 1997 Andrew McCallum * barrel.c (bow_barrel_printf): Don't print with paren's, so it will be easier to process with AWK. * rainbow.c (main): Add new option -B for printing barrel word vectors in ASCII. (rainbow_print_usage): Document it. * barrel.c (bow_barrel_printf): New function. * libbow.h: Declare new barrel function. Sat Jan 18 08:14:59 1997 Andrew McCallum * rainbow.c (main): Add -N option for turning off normalization of PrInd scores by setting BOW_PRIND_NORMALIZE_SCORES. (rainbow_print_usage): Document it. * vpc.c (FOILGAIN): New macro, defined to be 1. Switching back to doing foilgain by default. * prind.c (bow_prind_normalize_scores): Change to 1. Now normalizing scores by default. * libbow.h: Declare prind normalization global variable. * vpc.c (bow_barrel_new_vpc_with_weights): Condition choice of weight-scaling on FOILGAIN. Now default is to do scaling by information gain (again). * prind.c (bow_prind_normalize_scores): New global variable. Default: 0, don't normalize. Note, this is different than what we were doing before, and the default should be changed back to 1. (_bow_score_prind_from_wv): Use it. * rainbow.c (rainbow_index): Don't exit with error when a class directory is empty, just print a message. (rainbow_usage): Fix description of -G to Foil-gain, not info-gain. * prind.c: Improve formating of score printing. * rainbow.c (main): New command-line argument `-P' sets BOW_PRINT_WORD_SCORES. (rainbow_print_usage): Document it. * prind.c (bow_print_word_scores): Define new global variable. (_bow_score_prind_from_wv): Use it to decide when to print per-word/class score information. * libbow.h (bow_print_word_scores): Declare new global variable. Fri Jan 17 09:39:30 1997 Andrew McCallum * rainbow.c (printing_class): Global variable renamed from weight_vector_printing_class. (rainbow_print_foilgain): New function. (main): Call it, and add -F option for doing so. (rainbow_print_usage): Document it. * vpc.c (bow_barrel_new_vpc_with_weights): Use new foilgain function for PrInd, instead of infogain function. * info_gain.c (bow_barrel_scale_weights_by_info_gain): Set max_wi! Previously it was uninitialized. (bow_foilgain_per_wi_ci_new): New function. (bow_foilgain_free): New function. (bow_barrel_scale_weights_by_foilgain): New function. * libbow.h: Declare new foilgain functions. * docnames.c (bow_map_filenames_from_dir): Add assertion that checks for conditions under which directory-vs-file detection is unreliable. I should figure out why this isn't working as expected. * prind.c (bow_prind_scale_by_infogain): New global variable. * libbow.h (bow_prind_scale_by_infogain): New declared global varible. * rainbow.c (main): Add -G for setting bow_prind_scale_by_infogain. * vpc.c (bow_barrel_new_vpc_with_weights): Use BOW_PRIND_SCALE_BY_INFOGAIN. * rainbow.c (rainbow_query): Only re-build the RAINBOW_CLASS_BARREL if -m or -T arguments require its change. Thu Jan 16 13:35:31 1997 Andrew McCallum * vpc.c (bow_barrel_new_vpc_with_weights): For PrInd, scale weights by information gain. (Oooh, this may not be kosher; perhaps remove it later.) * rainbow.c (rainbow_print_weight_vector): Multiply the weight by its normalizer. * info_gain.c (bow_barrel_scale_weights_by_info_gain): Leave more space for verbosifying progress, because the numbers are big. * prind.c (_bow_barrel_set_prind_weights): Remove the information gain scaling, i.e. undo previous change. * info_gain.c (bow_barrel_scale_weights_by_info_gain): Change the arguments so that the information gain array is passed in, not calculated inside the function. * libbow.h: Change arguments to weight info gain scaling function. * naivebayes.c (_bow_score_naivebayes_from_wv): Scale DV WEIGHT by CDOC NORMALIZER! * prind.c (_bow_score_prind_from_wv): Scale DV WEIGHT by CDOC NORMALIZER! * rainbow.c (rainbow_query): If score is less than 1e-35, then just print zero. Do this for the sake of CommonLisp, which can't read numbers smaller than 1e-35. (rainbow_test): Likewise. (rainbow_test_files): Likewise. Wed Jan 15 10:52:13 1997 Andrew McCallum * scan.c (bow_scan_fp_for_string): Change the for() to a while() to clean up the handling of STRING_PTR incrementation. * rainbow.c (method): Fix its handling from last change. (rainbow_print_weight_vector): New function. (main): Call it. (rainbow_print_usage): Add -W. * prind.c (_bow_barrel_set_prind_weights): Added comment. * lex-simple.c (bow_lexer_simple_get_raw_word): Back up DOCUMENT_POSITION to point at terminating character. Add more comments. * int4str.c (_str_hash_lookup): Add assertion. (_str_hash_add): Add assertions. (bow_str2int): Increment MAP->STR_ARRAY_LENGTH++ first, then return -1 the value, instead of returning the value++. This makes the intermediate calls more clear and safe. * barrel.c (bow_barrel_keep_top_words_by_infogain): Improve verbosity and reduce number of times printed. * lex-simple.c (bow_lexer_simple_open_text_fp): When checking to see if we should realloc to increase the DOCUMENT buffer size, make sure we leave room for the terminating '\0' that we'll add later in the function! (Wow! This was a wild bug that's been around for a while, but only recently caused occasional crashes. The crashes were in totally unrelated functions in int4str.c. The GDB `watch' command came to the rescue!) * scan.c (bow_scan_fp_for_string): Make it ignore Carriage-Return '\r' characters, so we can reliably scan for MIME header separators. * barrel.c (bow_barrel_keep_top_words_by_infogain): Make it more efficient by using qsort(). * rainbow.c (method): Initialize to -1. (DEFAULT_METHOD): New macro, equal to bow_method_naivebayes. (rainbow_index): Use new macro. (rainbow_query): Cause `-m' to have an effect here. (rainbow_test): Likewise. (rainbow_test_files): Likewise. (rainbow_print_usage): Rearrange flags to reflect new contexts in which -T, -m, and -U are valid. (main): Don't use the length of argv[0] to determine value of BOW_VERBOSITY_USE_BACKSPACE. Tue Jan 14 09:35:20 1997 Andrew McCallum * prind.m: Comment fixes. * rainbow.c (test_percentage): Initialize it to 0, not 30. (DEFAULT_TEST_PERCENTAGE): New macro, equal to 30. (rainbow_test): Set TEST_PERCENTAGE to DEFAULT_TEST_PERCENTAGE if its zero. Set RAINBOW_CLASS_BARREL->METHOD to METHOD so we can give the -m option when using -t. (rainbow_test_files): Use TEST_PERCENTAGE and NUM_TEST_DOCS to determine how many training examples to ignore. Set RAINBOW_CLASS_BARREL->METHOD to METHOD so we can give the -m option when using -t. * dv.c (bow_dv_add_di_count_weight): Add parens around MAXSHORT, which seems to be needed on SunOS. * dv.c (_bow_dv_index_for_di): Reverse direction of for()-loop that scoots document entries up to make room! (Reported by Kamal Nigam.) * wi2dvf.c (bow_wi2dvf_dv): Gracefully handle arithmetic overflow of in COUNT. Print a warning the first time it happens. * rainbow.c (rainbow_test_files): Close the FP in the nested function TEST_FILE! Append "/" to the DIR. Use FILENAME_TO_CLASSNAME when setting CURRENT_CLASS. (main): Add 'x' to the getopt call. * configure.in: Look for perl5 before looking for perl. Mon Jan 13 09:51:32 1997 Andrew McCallum * rainbow.c (rainbow_test_files): New function. (main): Added local variable WHAT_DOING; use it. Add command-line option -x; call rainbow_test_files. Add command-line option -b. * lex-html.c: Use bow_verbose, instead of bow_quiet to print message about unterminated `<'. * docnames.c (bow_map_filenames_from_dir): Make it work even when DIRNAME is actually a filename. Sun Jan 12 12:37:32 1997 Andrew McCallum * rainbow.c (filename_to_classname): Make it work even when there isn't a `/' in the FILENAME. (rainbow_index): Use filename_to_classname(). * install.texi: Mention the need for GNU make. Add missing @end enumerate. * Makefile.in (default): Make it depend on all the DEMO_EXECUTABLES and the PERL_RUNNABLE_FILES, instead of just rainbow. * libbow.h: Update copyright. * primes.c (_bow_nextprime): Replace bzero by memset, for SunOS. * prind.c (_bow_barrel_set_prind_weights): Remove warning about old do-nothing loop. Remove the loop. * rainbow.c (rainbow_query): Set NUM_HITS_TO_SHOW equal to the number of classes, instead of just 2. Simplify output so it is more machine readable. (main): Require a length of 39, not 10, for argv[0] in order to turn off BOW_VERBOSITY_USE_BACKSPACE. (This was a hack so we don't get a lot of \b's inside gdb inside emacs.) Fix the getopt string to include a `:' after `v'. * Makefile.in (snapshot): cvs tag the repository. * rainbow.c: Deal with systems that don't have getopt.h. * weight.c (_bow_add_to_normalizer_total): Add case for BOW_METHOD_PRIND. (_bow_total_to_normalizer): Likewise. Sat Jan 11 17:57:16 1997 Andrew McCallum * arrow.c: Fix typo in last change. Fri Jan 10 11:03:51 1997 Andrew McCallum * configure.in: Look for getopt.h. * arrow.c: Deal with systems that don't have getopt.h. * rainbow-stats.pl (overall_accuracy): Print stderr for both verbosity levels! Thu Jan 9 11:46:41 1997 Andrew McCallum * rainbow-stats.pl (overall_accuracy): Print standard error, not standard deviation. Wed Jan 8 11:19:13 1997 Andrew McCallum * info_gain.c (bow_infogain_per_wi_new): Assert info gain is >= 0, not > 0. * barrel.c (bow_barrel_keep_top_words_by_infogain): Likewise. * rainbow.c (rainbow_index): Rearrange REUSE_ARCHIVED_BARREL_COUNTS logic so it works now. * vpc.c (bow_barrel_new_vpc): Don't assert DV, just continue if it's NULL. Tue Jan 7 09:45:54 1997 Andrew McCallum * libbow.h: Declare new barrel info gain function. Mon Jan 6 10:28:17 1997 Andrew McCallum * barrel.c (bow_barrel_keep_top_words_by_infogain): New function. * lex-gram.c (bow_lexer_gram_open_text_fp): Return NULL if LEX is NULL. * wi2dvf.c (bow_wi2dvf_remove_wi): New function. (bow_wi2dvf_write): Use new SEEK_START convention, in which it is -1 when DV is NULL; previously, when DV was NULL, it was equal to the previous SEEK_START. (bow_wi2dvf_new_from_data_fp): Likewise. * libbow.h: Declare new wi2dvf function. * rainbow.c, weight.c, vpc.c, score.c, libbow.h: Separate PrTFIDF from PrInd (Fuhr's Probabilistic Indexing). * prind.c: New file, for Fuhr's Probabilistic Indexing method. * rainbow.c (rainbow_index): Prune words by info gain in barrel, not in the word vocabulary, so that `-L' can work properly. Fri Jan 3 12:56:53 1997 Andrew McCallum * rainbow.c (main): Add the -L option, for turning off lexing the text files, and instead using the word counts in the archived barrel. (rainbow_print_usage): Likewise. (reuse_archived_barrel_counts): New global variable, controlling this. * lex-html.c (bow_lexer_html_get_word): Change type of argument SELF to match BOW_LEXER. * rainbow.c (main): Add the -s option, for turning off use of the stoplist. (rainbow_print_usage): Likewise. * rainbow.c (main): Add the -U option, for turning off uniform priors in PrTFIDF. (rainbow_print_usage): Likewise. * rainbow-stats.pl: Add test of $#ARGV to the `-s' test, so it actually works the way it's supposed to. Wed Jan 1 16:54:28 1997 Andrew McCallum * lex-html.c (bow_lexer_html_get_raw_word): Print warning when we find an unterminated open bracket `<'. Verbosify about close bracket warning with priority of BOW_VERBOSE, not BOW_PROGRESS. `rainbow -i -H -S' now seems to be working. * rainbow.c (rainbow_underlying_lexer): New global variable. (rainbow_html_lexer): New global variable. (rainbow_print_usage): Overhauled to accurately describe the valid arguments. (main): Rearrange and clean up argument handling. * libbow.h: Change type of argument SELF in BOW_LEXER_SIMPLE word-getting subfunctions. * lex-simple.c (bow_lexer_simple_open_text_fp): Deal with EOF in FP. Deal with zero-length documents. After we find END_PATTERN, move the DOCUMENT_POINTER back to the beginning of of the END_PATTERN. (bow_lexer_simple_postprocess_word): Change type of SELF from BOW_LEXER to BOW_LEXER_SIMPLE. (old_bow_lexer_simple_get_word): Old, unused function removed. * lex-html.c (bow_lexer_html_get_raw_word): Keep a count of the HTML bracket nestings, instead of keeping track as a boolean. (bow_lexer_html_get_word): Postprocess word using the underlying lexer from SELF, not SELF itself. Tue Dec 31 12:36:21 1996 Andrew McCallum * int4word.c (bow_num_words): If WORD_MAP has not yet been created, return 0, instead of raising an error. * lex-html.c (bow_lexer_html_get_raw_word): Look for end by comparing to 0, not EOF. Fix termination condition of true-to-start loop. Change type of SELF to BOW_LEXER_SIMPLE from BOW_LEXER. (bow_lexer_html_get_word): Change type of SELF to BOW_LEXER_INDIRECT from BOW_LEXER. * lex-simple.c (bow_lexer_simple_get_raw_word): Look for end by comparing to 0, not EOF! * libbow.h (bow_str2method): Add "tfidf" as a synonym for tfidf_log_occur. * rainbow-stats.pl: Now the `-s' argument causes it to print only accuracy average and standard deviation. (verbosity): New variable. Mon Dec 30 15:34:25 1996 Andrew McCallum * wi2dvf.c (bow_wi2dvf_add_di_text_fp): Loop over all documents (LEX's) in the file. * int4word.c (bow_words_add_occurrences_from_text_dir): Likewise. * wv.c (bow_wv_new_from_lex): New function. (bow_wv_new_from_text_fp): Use it. Handle NULL lex. * libbow.h: Declare new WV function. * lex-simple.c: Remove the N-gram lexer. * lex-gram.c: lex-indirect.c, lex-html.c: New files. * Makefile.in (LIBBOW_C_FILES): Added lex-gram.c, lex-html.c, lex-indirect.c. * libbow.h: Declare new lexer functions, types and variables. Sun Dec 29 13:04:17 1996 Andrew McCallum * lex-simple.c: Make all the instances of BOW_LEXER use a NULL DOCUMENT_END_PATTERN. (bow_lexer_simple_open_text_fp): Instead of scanning the FP twice, have it fill and grow the document buffer as it reads the FP for the first time. (This now seems to work on STDIN, although I haven't tried non-NULL DOCUMENT_END_PATTERN's with it; I'm not sure if FSEEK works on STDIN.) * libbow.h (bow_lexer): Comment the start and end patterns. * scan.c (bow_scan_fp_for_string): If STRING is the empty string, return immediately instead of scanning to EOF. The NULL string still scans to EOF. Fri Dec 27 20:00:30 1996 Andrew McCallum The changes for the new lexer. It now seems to be working. * libbow.h (bow_lex): New type, replacing BOW_PARSE. (bow_lexer): New type, replacing BOW_PARSER. (bow_lexer_simple): New type. New lexers based on this. (bow_lex_gram): New type. (bow_lexer_gram): New type. (bow_default_lexer): Renamed from BOW_DEFAULT_PARSER. (bow_stem_porter): Renamed from BOW_STEM. (bow_isalpha): New function declaration. (bow_isgraph): Likewise. * Makefile.in (LIBBOW_C_FILES): Remove p-alpha.c, p-alonly.c, p-gram.c, p-white.c. Add lex-simple.c. (DEMO_C_FILES): Remove robin.c. * defparser.c (bow_default_lexer): Renamed from BOW_DEFAULT_PARSER. * int4word.c (bow_words_add_occurrences_from_text_dir): Use new lexer instead of old parser. * wi2dvf.c (bow_wi2dvf_add_di_text_fp): Likewise. * wv.c (bow_wv_new_from_text_fp): Likewise. * rainbow.c (rainbow_lexer): New global variable. (main): Use new lexer instead of old parser. Using BOW_ALPHA_LEXER as the underlying lexer instead of the old BOW_ALPHA_ONLY_PARSER. * stem.c (bow_stem_porter): Renamed from BOW_STEM. * lex-simple.c: New file. * scan.c (bow_scan_fp_for_string): If STRING is NULL or zero-length, then instead of immediately returning zero, scan through the FP until EOF. Thu Dec 26 12:20:44 1996 Andrew McCallum The last version before the many `lexer' changes. * int4word.c (bow_words_write): Write the WORD_MAP_COUNTS also. (bow_words_read_from_fp): Read and create them. * dv.c (bow_dv_add_di_count_weight): Assert that the new count is greater than zero. * prtfidf.c (_bow_barrel_set_prtfidf_weights): Assert that DV->IDF is greater than zero. Tue Dec 24 17:36:09 1996 Andrew McCallum * rainbow-stats.pl (calculate_accuracy): Use printf and %g instead of print. (overall_accuracy): Calculate and print standard deviation also. * rainbow.c (main): Use BOW_GRAM_PARSER_PARSER. * p-gram.c (bow_gram_parser_parser): New global variable. (bow_gram_parser_open_text_fp): Set it to BOW_DEFAULT_PARSER if it's NULL. Use it. (bow_gram_parser_close): Use it. (bow_gram_parser_get_word): Likewise. * libbow.h: Declare BOW_GRAM_PARSER_PARSER. * prtfidf.c (bow_prtfidf_uniform_priors): New global variable. Default is to use *uniform* class prior probabilities. (_bow_barrel_set_prtfidf_weights): Don't set the DV->IDF here, we'll use its current value later. (_bow_score_prtfidf_from_wv): Move the test for !DV. Pay attention to BOW_PRTFIDF_UNIFORM_PRIORS, and do the right thing. Sun Dec 22 13:35:03 1996 Andrew McCallum * .cvsignore: Added executables arrow, robin, rainbow-stats. * Makefile.in (INSTALL_FILES): New variable. (install): Use it. Fix removing of old executables. Install Perl files. * libbow.h (bow_str2method): New macro. (bow_words_keep_top_by_infogain): Declare function. * int4word.c: Add comment. * rainbow.c (rainbow_index): Remove words with occurrences less than X even if NUM_TOP_WORDS_TO_KEEP is non-zero. (main): New command-line argument `-R'. Use new bow_str2method(). * stoplist.c: Turn back on the builtin stoplist. Fri Dec 20 15:53:50 1996 Andrew McCallum * prtfidf.c: New file. Tue Dec 17 18:19:55 1996 Andrew McCallum * rainbow.c (main): Added prtfidf for `-m'. * score.c (bow_get_best_matches): Do the right thing for prtfidf, call _bow_score_prtfidf_from_wv. * stoplist.c (init_stopwords): Temporarily turn off the builtin stoplist, for use with the demo data. Yipes, this needs to be turned back on! * vpc.c (bow_barrel_new_vpc): Do the right thing for prtfidf; treat it like naivebayes. (bow_barrel_new_vpc_with_weights): Likewise. * weight.c (bow_barrel_set_weights): Call _bow_barrel_set_prtfidf_weights when appropriate. * Makefile.in (LIBBOW_C_FILES): Added prtfidf.c. Mon Dec 16 13:11:59 1996 Andrew McCallum * arrow.c (arrow_unarchive): Add verbosification. * info_gain.c (bow_infogain_per_wi_new): Fix verbosification. * info_gain.c (bow_infogain_per_wi_new): Add verbosifying. * rainbow.c (num_top_words_to_keep): Set to zero as a default. (rainbow_index): Make it possible to call both occurrence pruning and infogain pruning. (main): New command-line argument `-m'. * vpc.c (bow_barrel_new_vpc_with_weights): Create the VPC barrel, and then normalize the weights, otherwise we get -1 normalizers! * score.c (bow_get_best_matches): Delete more leftover naivebayes code. Assert that the normalizer is greater than 1. * rainbow.c (num_top_words_to_keep): New global variable set from command line. (rainbow_index): New nest function DO_INDEXING. Use it. Add term pruning according to information gain. (main): New command line argument `-T' to set num top words. * robin.c (robin_index): Do the right thing when WI is -1. * wi2dvf.c (bow_wi2dvf_add_di_text_fp): Likewise. * wv.c (bow_wv_new_from_text_fp): Likewise. * int4word.c (bow_words_keep_top_by_infogain): Implemented. (bow_words_add_occurrences_from_text_dir): Do the right thing when WI is -1. * info_gain.c (bow_infogain_per_wi_new): Set info gain to 0 when the DV for that word NULL. * barrel.c (bow_barrel_new): Add new argument. Separate capacities for the cdocs array and the wi2dvf. * libbow.h: Declare new argument in bow_barrel_new. * arrow.c (arrow_index): Use new extra argument to barrel_new. * vpc.c (bow_barrel_new_vpc): Use new extra argument to barrel_new. * naivebayes.c (_bow_score_naivebayes_from_wv): Removed ununsed local variable. Wed Dec 11 15:19:44 1996 Andrew McCallum * Makefile.in (diff): Ignore the non-zero exit status from `diff'. * Makefile.in (dist): Call cvs rtag. (diff): New target. (clean): Delete *.info and *.dvi. (maintainer-clean): Delete $(PERL_RUNNABLE_FILES), configure, README, and INSTALL. * int4word.c (bow_words_keep_top_by_infogain): New function; not yet implemented. * naivebayes.c (_bow_score_naivebayes_from_wv): Incoporate P(w|C) for all words in query document, not just those in the DV. * rainbow.c: Added more comments. (rainbow_wi2dvf_sum_classes): Function removed. * Version (BOW_MINOR_VERSION): Version 0.5. * libbow.h (BOW_MINOR_VERSION): Likewise. This version given to Kamal. Tue Dec 10 20:22:29 1996 Andrew McCallum * rainbow-stats.pl: New file. Changed from Sean's version to include scientific notation in number regular expression. Naive-Bayes code runs without crashing, but it provides horrible results on the CIA type data. Average accuracy of 7%. It almost always chooses Defense_Forces. * naivebayes.c (_bow_barrel_set_naivebayes_weights): Rewrite from scratch, avoiding the use of heaps. * libbow.h: Include , so we get PATH_MAX. * naivebayes.c: New file. * Makefile.in (LIBBOW_C_FILES): Added naivebayes.c. * weight.c: Remove the NaiveBayes code to naivebayes.c. * score.c (bow_get_best_matches): Likewise. * vpc.c (bow_barrel_new_vpc): Remove the NaiveBayes prior-setting to naivebayes.c. * rainbow.c: Remove the commented-out pre-vpc code. Change the default method to naivebayes. Mon Dec 9 10:11:05 1996 Andrew McCallum CIA type data shows performance improvement from 1-grams to 1/2-grams: 79% to 68% accuracy. * wi2dvf.c (bow_wi2dvf_dv): Fix assertion for when doing the last WI. * vpc.c (bow_barrel_new_vpc): Verbosify and fix off-by-one error in class index handling. * rainbow.c (rainbow_classnames): New global variable. (rainbow_unarchive): Set it. (rainbow_index): Verbosify while we read files for word pruning. * libbow.h (PATH_MAX): Avoid warning in surrounding #if. (bow_fopen): Use perror() as well as bow_error. * int4word.c (bow_words_add_occurrences_from_text_dir): Keep track of the text file count, and verbosify. * barrel.c (bow_barrel_new): Create the new wi2dvf with bow_num_words(), not CAPACITY. (_bow_barrel_cdoc_free): Only free FILENAME if it's non-NULL. * array.c (bow_array_entry_at_index): Fix off-by-one error in assertion. * rainbow.c: Make it work with new vpc function, but old code is still there commented-out. * libbow.h: Declare new vpc and infogain functions. * info_gain.c: Comment new functions. * score.c (bow_get_best_matches): Add code to do NaiveBayes; thanks to Dunja, who helped. * vpc.c (bow_barrel_new_vpc): Totally rewritten. Now simpler and faster. Don't create a dv_heap, just go through the wi2dvf by words. (bow_barrel_new_vpc_with_weights): New function. * weight.c (_bow_barrel_set_weights_naivebayes): Renamed from _bow_barrel_set_weights_sans_idf. Verify that the class priors are set. (bow_barrel_set_weights): Use new function name. * split.c (drand48, srand48) [__sun__]: Add prototypes. Fri Dec 6 17:57:46 1996 Andrew McCallum * wi2dvf.c (bow_wi2dvf_print_stats): Don't use //-style comments. * weight.c (bow_barrel_set_weight_normalizers): Likewise. * libbow.h: Add inclusions and declarations needed for SunOS; thanks to Sean. * rainbow.c (prune_words_with_occurrences_less_than): New global variable. (rainbow_index): Use it. (rainbow_query): Set and normalize the QUERY_WV weights! (rainbow_test): Likewise. * dv.c (bow_dv_default_capacity): Value changed from 4 to 2, in an effort to reduce memory use. * email.c (bow_email_get_replyid): Don't insist that the opening `<' is on the same line as "In-Reply-To:". * int4word.c (word_map_counts, word_map_counts_size): New static variables. (bow_word2int_do_not_add): New static variable. (_bow_int4word_initialize): New function. (bow_word2int): Use it. Pay attention to bow_word2int_do_not_add. (bow_words_set_map): New function. (bow_word2int_add_occurrence): New function. (bow_words_occurrences_for_wi): New function. (bow_words_remove_occurrences_less_than): New function. (bow_words_add_occurrences_from_text_dir): New function. * libbow.h: Declare new bow_words_ functions. * robin.c (robin_index): Use new function bow_word2int_add_occurrence(). * weight.c (bow_barrel_set_weight_normalizers): Free the heap before returning! * wi2dvf.c (bow_wi2dvf_add_di_text_fp): Use new function bow_word2int_add_occurrence(). * wv.c (bow_wv_new): Initialze normalizer to 1. (bow_wv_new_from_text_fp): Likewise. (bow_wv_new_from_text_fp): Use new bow_word2int_add_occurrence(). Thu Dec 5 09:49:46 1996 Andrew McCallum * score.c (bow_get_best_matches): Make sure QUERY_WV->NORMALIZER is non-zero. * weight.c (bow_wv_set_weight_normalizer): Make sure TOTAL is non-zero. * rainbow.c: Include for DEC Alpha's. * arrow.c: Likewise. Wed Dec 4 10:44:52 1996 Andrew McCallum * Makefile.in (PERL): New variable. (LIBBOW_C_FILES): Added p-gram.c. (PERL_FILES): New variable. (PERL_RUNNABLE_FILES): New variable. (DIST_FILES): Added PERL_FILES. (all): Add dependancy on PERL_RUNNABLE_FILES. (PERL_RUNNABLE_FILES): New rule. * configure.in: Look for perl in path. * rainbow.c (infogain_words_to_print): New global variable, set by command-line arguments. (main): Set BOW_DEFAULT_PARSER to BOW_GRAM_PARSER; set BOW_GRAM_PARSER_GRAM_SIZE to 1. New command line options, -g, -I, -h. Call BOW_INFOGAIN_PER_WI_PRINT. * info_gain.c (bow_infogain_per_wi_new): New function. (bow_infogain_per_wi_print): New function. (bow_barrel_scale_by_info_gain): Use new function above. * p-gram.c: New file. * libbow.h (bow_parser_skip_net_header): Declare new global variable. (bow_gram_parser): Declare new parser struct. (bow_gram_parser_gram_size): Declare new global variable. * defparser.c (bow_parser_skip_net_header): Define and initialize to 0. * p.inc (BOW_P_OPEN_NAME): If BOW_PARSER_SKIP_NET_HEADER is non-zero, scan into the FP past the first "\n\n", in order to skip over the email/news header. Tue Dec 3 10:20:11 1996 Andrew McCallum * wv.c (bow_wv_count_for_wi): Use bow_wv_entry_for_wi() instead of duplicating code. * wi2dvf.c (bow_wi2dvf_new): Initialize the FP to NULL! (bow_wi2dvf_dv): Assert that WI isn't larger than the WI2DVF->SIZE. Assert that IDF isn't NaN; twice. * split.c (bow_test_new_heap): Drastically simplify. (bow_test_next_wv): Free the old *WV if isn't non-NULL. Use bow_wv_new() instead of creating it with malloc by hand. When we've reached the end of the heap, free the *WV. * score.c (bow_get_best_matches): Make CURRENT_SCORE a double instead of a float. Assert that IDF isn't NaN. Don't normalize the query WV. Most important: avoid a memory leak by freeing the HEAP when we are done with it! * rainbow.c (rainbow_wi2dvf_sum_classes): Set the class IDF from the doc IDF. Still add in the count and weight, even if the weight is zero. This means the wi2dvf will expand to the proper size so we can meaningfully get DV's from it. (rainbow_set_weights): Don't scale by info gain. (rainbow_test): Initialize the QUERY_WV to NULL so bow_test_next_wv() will know not to free an uninitialized value. * heap.c (bow_dv_heap_free): New function. (bow_make_dv_heap_from_wv): Add assertion checking for IDF NaN. * dv.c (bow_dv_new_from_data_fp): Add comment about FP assertion. * weight.c: Assert that IDF is not NaN. Don't print progress verbosity every time through the loop---it's slowing us down---only print it every 10 times through the loop. * dv.c (bow_dv_new): Initialize the IDF to zero! (bow_dv_write_size): Include the IDF size in the return value. (bow_dv_write): Write the IDF. (bow_dv_new_from_data_fp): Read the IDF. * rainbow.c: Keep two barrels: one for classes, one for documents. (num_trials, test_percentage, method): New global variables set by command-line switches. (rainbow_archive): Deal with both barrels. (rainbow_unarchive): Likewise. (rainbow_set_weights): New function... (rainbow_wi2dvf_sum_classes): ...using code pulled from here. (filename_to_classname): New function. (rainbow_test): New function. (main): Add new command line switches -t, -p. Call rainbow_test(). * arrow.c: Use new weight normalization functions. * split.c: Renamed functions to all begin with `bow_test_'. Use argument `barrel' instead of `cdoc' and `wi2dvf'. * weight.c: Use bow_method instead of bow_idf_type and bow_normalize_type. All functions changed. * score.c (bow_get_best_matches): Rename some variables. Add mechanics of NaiveBayes. Normalize query vector. Normalize non-NaiveBayes outside the loop. * barrel.c (bow_barrel_new): Fix initialization of METHOD. (bow_barrel_add_from_text_dir): Initialize the PRIOR. (_bow_barrel_cdoc_write): Write the PRIOR. (_bow_barrel_cdoc_read): Read the PRIOR. (bow_barrel_new_from_data_fp): Read the METHOD properly. * libbow.h: Remove types as arguments to some weight functions. Rename the test/train split functions. (bow_cdoc): Added member PRIOR. (bow_barrel): Added member METHOD. (bow_method): New enum. (bow_idf_type): Removed. (bow_normalize_type): Removed. * Makefile.in (LIBBOW_C_FILES): Added split.c. * barrel.c (bow_barrel_new): Set RET->METHOD to default of BOW_METHOD_TFIDF. (bow_barrel_new_from_data_fp): Read METHOD. (bow_barrel_write): Write METHOD. * weight.c (_bow_add_to_normalizer_total): New function. (_bow_total_to_normalizer): New function. (bow_barrel_set_weight_normalizers): Use them. (bow_wv_set_weights): Function moved here from wv.c. (bow_wv_set_weight_normalizer): Likewise. * wv.c: Weight and normalizer functions moved to weight.c. * libbow.h: Move the WV weight-setting functions next to the barrel weight-setting functions. (bow_cdoc): Rename memeber LENGTH to NORMALIZER, for clarity. (bow_barrel_set_weight_normalizers): Renamed from bow_barrel_normalize_weights, since it doesn't actually change the weight values. * wv.c: Include (sqrtf): New macro. (bow_wv_set_normalizer): Renamed from bow_wv_set_norm(). New argument TYPE. Obey new argument. (bow_wv_set_weights): New argument TYPE. Obey it. Don't call the normalizer function. (bow_wv_write): Use new name WV->NORMALIZER. (bow_wv_new_from_data_fp): Likewise. * score.c (bow_get_best_matches): Use renamed NORMALIZER member. * barrel.c (_bow_barrel_cdoc_write): Likewise. (_bow_barrel_cdoc_read): Likewise. * libbow.h: Rename and add new arguments to WV functions. Mon Dec 2 13:14:10 1996 Andrew McCallum * arrow.c (arrow_unarchive): Don't close the barrel FP, because we still have yet to read the DV's from it! * barrel.c (bow_barrel_add_from_text_dir): Print warning if we end up finding more binary files than text files. * score.c: Some formatting and comment changes. * weight.c: Some comment and variable name changes. (_bow_add_to_idf): Renamed from bow_add_to_total. (_bow_barrel_set_weights_nb): New function for doing Naive Bayes. (bow_barrel_set_weights): Call it if necessary. * libbow.h: Declare new vpc function. (bow_idf_nb): New idf type. * Makefile.in (LIBBOW_C_FILES): Added vpc.c. * vpc.c (bow_barrel_new_vpc): Renamed from bow_barrel2vpc_barrel. Replace use of printf() with bow_verbosify(). Minor formatting changes. Mon Dec 2 13:09:10 1996 Sean Slattery * vpc.c: New file - implements vector per class models. Basically, take a barrel and produce a vector per class barrel from it. Tue Nov 26 15:56:07 1996 Andrew McCallum * weight.c (_bow_add_to_total): Renamed to include a prefixing `_'. Declared `static inline'. (bow_barrel_set_weights): Overhauled and simplified. I'm not sure I haven't broken it, though. Previously some of the `if()else' clauses seemed contradictory to me. * score.c (bow_get_best_matches): Add comment about my perceived pending need for normalization of the query vector. * dv.c (_bow_dv_index_for_di): Add 1 to the DV length when it was zero! (bow_dv_add_di_count_weight): New function, replacing bow_dv_add_di_count. (bow_dv_add_di_weight): Function removed. * libbow.h: Declare new wi2dvf function, and remove old ones. * rainbow.c (rainbow_wi2dvf_sum_classes): Use new dv function. * wi2dvf.c (bow_wi2dvf_add_wi_di_count_weight): New function, replacing bow_wi2dvf_add_wi_di_count. Use new dv function. (bow_wi2dvf_add_di_wv): Use new dv function. (bow_wi2dvf_add_di_text_fp): Likewise. * Makefile.in (LIBBOW_C_FILES): Added scan.c; although this will be taken away once I change parsing to use strings and librx. * scan.c: New file. * heap.c (bow_make_dv_heap_from_wi2dvf): Add silly assert()ion. * email.c (bow_email_get_date): Don't cause error when Date isn't found, just return 0. * dv.c (_bow_dv_index_for_di): New function that captures the guts of preparing a spot to add a count or weight. (bow_dv_add_di_count): Use it. (bow_dv_add_di_weight): Use it. * info_gain.c (bow_entropy): Ensure COUNTS[i] isn't zero before calculating entropy. Mon Nov 25 11:51:35 1996 Sean Slattery * weight.c: Added support of bow_prtfidf weighting which gives an idf = sqrt(total occurances/occurances). (bow_barrel_set_weights): added code to calculate the total number of occurances and changed idf calculations that had those pesky 1.0's with the real totals intended. Doing things this way ensures the weights don't go below 0. * score.c (bow_get_best_matches_euclidian): New function. Gets best matches badet on the euclidean distance between vectors instead of the cosine of the angel between them. * dv.c (bow_dv_add_di_weight): Made this function capable of updating weights that occur before the last element entered. It assumes the documents are in the list in ascending order of their indices. Should make this change to the bow_dv_add_di_count function as well, but this was the minimum I needed to get vector per class stuff done. Mon Nov 18 14:17:55 1996 Andrew McCallum * wv.c (bow_wv_set_norm): Initialize TOTAL to zero! It was uninitialized. (bow_wv_write): Write the NORM! (bow_wv_new_from_data_fp): Read it. (bow_wv_write_size): Adjusted for writing NORM. * stoplist.c (bow_stoplist_add_from_file): Screaming verbosify each word that's added. * heap.c (bow_make_dv_heap_from_wv): Get the DV using bow_wi2dvf_dv(), not by accessing the structure directly. Otherwise, we will no properly properly read in the DVF from disk. * weight.c (bow_barrel_set_weights): Likewise. * p.inc (BOW_P_GET_WORD_NAME): Also check if word is on the stoplist *after* stemming. Tue Nov 5 12:15:37 1996 Andrew McCallum * weight.c (bow_barrel_set_weights): Use the total number of documents instead of 1.0. Fri Nov 1 11:27:23 1996 Andrew McCallum * p.inc: Fix the handling of BOW_P_STOPLIST_CHECKER. * int4str.c (bow_int4str_new_from_fp): Make it work even for strings that contain spaces, (but not newlines). (bow_int4str_write): Make sure the strings don't contain newlines. Add generalizable parsing facilities. * libbow.h (bow_parse, bow_parser): New types. Add new parsing funcs. (bow_get_word): Function removed. Use new parsing facilities instead. * Makefile.in (DIST_FILES): Added p.inc. * Makefile.in (LIBBOW_C_FILES): Added p-alpha.c, p-alonly.c, p-white.c. * p.inc, p-alpha.c, p-alonly.c, p-white.c: New files. * Makefile.in (LIBBOW_C_FILES): Added defparser.c. Removed getword.c. * wi2dvf.c (bow_wi2dvf_add_di_text_fp): Use new parser. * wv.c (bow_wv_new_from_text_fp): Use new parser. * arrow.c (arrow_index): Use new bow_barrel_add_from_text_dir function. Thu Oct 31 15:18:49 1996 Andrew McCallum * Version (BOW_MINOR_VERSION): Version 0.4. * libbow.h (BOW_MINOR_VERSION): Version 0.4. * rainbow.c: Add output filename feature. Use bow_idf_words, which unlike bow_idf_log_words, seems to work. (rainbow_index): Scale by information gain. * barrel.c (bow_barrel_add_from_text_dir): Add new EXCEPT_NAME argument. Deal with NULL EXCEPT_NAME. * libbow.h: Add new argument to barrel function. * weight.c (bow_barrel_set_weights): Add prefix and postfix verbosity strings. (bow_barrel_normalize_weights): Add verbosifying. * info_gain.c (bow_barrel_scale_by_info_gain): Add verbosifying. * barrel.c (bow_barrel_add_from_text_dir): Don't print the number of "binary files". * rainbow.c: Added some verbosifying. * rainbow.c: Don't close the rainbow_barrel fp. Set the weights in the right place. Put the indexing code in main(). Now running to completion. * dv.c (bow_dv_new_from_data_fp): Add new assertion that should help us catch closed FP's. * docnames.c (bow_map_verbosity_level): New global variable. (bow_map_filenames_from_dir): Use it. * barrel.c (bow_barrel_add_from_text_dir): Renamed from bow_barrel_new_from_text_dir. Don't create a new barrel, just add to a pre-existing one. * libbow.h: Declare renamed function. * libbow.h (bow_fwrite_string): Handle the NULL string for argument S. (bow_fread_string): Match bow_fwrite_string handling of NULL. Mon Oct 28 12:03:11 1996 Andrew McCallum * info_gain.c (bow_barrel_scale_by_info_gain): Renamed from bow_wi2dvf_scale_by_info_gain. * libbow.h: Rename info gain function to use `barrel'. * rainbow.c: Totally rewritten to be a document classifier. * wi2dvf.c (bow_wi2dvf_add_di_wv): Increase wi2dvf size with a MAX(), so we are guaranteed to be big enough. (bow_wi2dvf_add_wi_di_count): Likewise. (bow_wi2dvf_add_wi_di_weight): Likewise. (bow_wi2dvf_write): Incorporate initial seek position into calculations, in case we are writing to a file that already has other stuff at the beginning. * barrel.c (bow_barrel_free): New function. (bow_barrel_new_from_text_dir): Print shorter verbosity. * libbow.h: Declare bow_barrel_free(). * arrow.c (arrow_index): Set the weights. (main): Raise error if no text documents found. * libbow.h: Declare bow_barrel_new(), and fix typo. * wi2dvf.c (bow_wi2dvf_add_wi_di_weight): New function. * libbow.h: Declare new wi2dvf function. * dv.c (bow_dv_add_di_weight): New function. * libbow.h: Declare new dv function. * weight.c (bow_barrel_set_weights): Renamed from bow_wi2dvf_set_weights. (bow_barrel_normalize_weights): Renamed from bow_wi2dvf_normalize_weights. * libbow.h: Rename weight functions to use `barrel'. * barrel.c (bow_barrel_new_from_text_dir): Take new CLASS argument. Set the `class' of the new cdoc's accordingly. * libbow.h: Add new argument to barrel function. Fri Oct 25 13:05:16 1996 Andrew McCallum * arrow.c (main): Create the data directory if it doesn't exist already. * Version (BOW_MINOR_VERSION): Version 0.3. * sarray.c (bow_sarray_new_from_data_fp): Renamed from bow_sarray_new_from_fp. * libbow.h: Rename sarray function. * Makefile.in (version.texi): Use renamed BOW_ variables. (libbow.h): New target with rules that keep it up to date with ./Version. * libbow.h (BOW_MAJOR_VERSION): New macro. (BOW_MINOR_VERSION): New macro. (BOW_VERSION): New macro. * Version (BOW_MAJOR_VERSION): New variable. (BOW_MINOR_VERSION): New variable. (BOW_VERSION): Use them; renamed from LIBBOW_VERSION. * arrow.c: New file. * Makefile.in (DEMO_C_FILES): Added arrow.c. * barrel.c: New file. * libbow.h: Declare barrel archiving functions. * stoplist.c (bow_stoplist_add_from_file): Add a verbosify message. Wed Oct 23 16:45:46 1996 Andrew McCallum * libbow.h (bow_barrel): New type. Use it in all places where a WI2DVF and CDOCS were used together; several function arguments changed. * wi2dvf.c (bow_wi2dvf_new_from_text_dir): Function removed. Similar function is now in barrel.c. * weight.c (bow_wi2dvf_set_weights): Use bow_barrel. (bow_wi2dvf_normalize_weights): Likewise. * score.c (bow_get_best_matches): Use bow_barrel. * info_gain.c (bow_wi2dvf_scale_by_info_gain): Use new bow_barrel type. * Makefile.in (LIBBOW_C_FILES): Add barrel.c. * array.c (bow_array_new_from_data_fp): Renamed from bow_array_new_from_fp. * sarray.c (bow_sarray_new_from_fp): Use renamed function bow_array_new_from_data_fp. Tue Oct 22 14:03:38 1996 Andrew McCallum * email.c (_scan_fp_for_string): Make `\n' at the beginning of the search string match the beginning of the file. * error.c (_bow_error) [__linux__]: Call abort() instead of exit() because it lets us find ourselves in GDB. Still don't do it for non-Linux systems, because apparently on other systems there was a problem with flushing stderr when calling abort(). * docnames.c (bow_map_filenames_from_dir): Don't verbosify the directory names if we're not BOW_VERBOSITY_USE_BACKSPACE. * email.c: To several functions add new argument that negates test, or that insists on a search all on one line. (_bow_email_get_email_address): New function. (bow_email_get_sender): New function. (bow_email_get_recipient): New function. (bow_email_get_date): New function. * libbow.h: Declare new email functions. Mon Oct 21 12:08:45 1996 Andrew McCallum * libbow.h (bow_parse_news_headers): Add missing semi-colon to declaration. * info_gain.c (bow_entropy): Get the "document vector" with bow_wi2dvf_dv(), not by following the pointer directly. Otherwise, we won't properly read the DV in from the file, and may get inappropriate NULLs. * Makefile.in (LIBBOW_C_FILES): Add info_gain.c. * HACKING: Correct directions for checking out bow from CVS. Sat Oct 19 00:49:08 1996 Sean Slattery * news.c: Function for parsing news article headers. Useful for looking for crosspostings for multiple classifications. (bow_parse_news_headers): Added a getc to dump the first whitespace character after the : proceeding the header (bow_headers2newsgroups): New function to grok the bow_sarray returned by bow_parse_new_headers and return a bow_array of strings corresponding to every newsgroup mentioned in the newsgroup line. * libbow.h: Added def'n for new function * libbow.h: Added bow_parse_news_headers Fri Oct 18 10:50:23 1996 Andrew McCallum * info_gain.c (log2f): #define it if ./configure determined that we don't have it. (bow_entropy): Use log2f instead of log2. (MIN): Macro removed. It's now in libbow.h. Fri Oct 18 21:32:42 1996 Sean Slattery * array.c (bow_array_append): Changed test from array->length > array->size to array->length >= array->size. When array->length = array->size, we're run out of space. (bow_array_init): Assigned array->free_func to free_func. Otherwise free_func is not initialised and the bow_array_free function will sometimes crash. Fri Oct 18 10:50:23 1996 Andrew McCallum * libbow.h: Declare new functions. (bow_wv): LENGTH entry renamed to NORM. * wv.c (bow_wv_set_norm): New function. (bow_wv_set_weights): New function. * error.c (_bow_error): Call exit(-1) instead of abort(). It makes a prettier error message on the console. * heap.c (bow_make_dv_heap_from_wi2dvf): Separate the index into words and index into the heap so that we can handle wi2dvf's that have some NULL "document vectors". * libbow.h (MIN): New macro. (MAX): New macro. (bow_verbosity_use_backspace): New global variable declaration. * error.c (bow_verbosity_use_backspace): New global variable. (bow_verbosify): Use it. * weight.c (MIN): Remove definition. It's now in libbow.h. * libbow.h (bow_wi2dvf_normalize_weights): Change from `normalise' to American spelling. The *.c file had already been changed. Fri Oct 18 01:15:57 1996 Sean Slattery * info_gain.c (bow_wi2dvf_scale_by_info_gain): New file, information gain routine. (bow_entropy): Cast some of the arithmitic to floats - dividing one integer by another tends to go to 0 here. * libbow.h: added definition for above. * split.c: (bow_next_test_wv) Free heap when we've exhausted the test set (for tidyness) * heap.c (bow_make_dv_heap_from_wi2dvf): Changed malloc to bow_malloc (bow_make_dv_heap_from_wv): Changed malloc to bow_malloc Thu Oct 17 11:08:42 1996 Andrew McCallum * score.c (bow_get_best_matches): Add an assert()'ion that WI match the word index of our current location in the word vector. * libbow.h: Rename local variables from num_written to num_read where appropriate. * heap.c (bow_make_dv_heap_from_wv): Fix typo: continue when DV is NULL, not the other way around. * wi2dvf.c (bow_wi2dvf_write): Don't close the FP at the end! We didn't open it. (bow_wi2dvf_new_from_data_file): Don't close the FP, it will still be needed to read the DV's. * libbow.h (bow_wi2dvf_write_data_file, bow_wi2dvf_new_from_data_file): Re-add declarations for these functions. * heap.c (bow_make_dv_heap_from_wv): WV->LENGTH is not the number of entries in the word vector, it is the Euclidean length! Change all uses of WV->LENGTH to WV->NUM_ENTRIES. * libbow.h (bow_wv): Renamed element `length' to `total' in an attempt to choose a less confusing name. Other naming suggestions welcome. * array.c (bow_array_new_from_fp): Set the LENGTH of the new array; before it was uninitialized! * libbow.h (bow_fwrite_string): Properly calculate the number of characters written. (bow_fread_string): Likewise, and parenthesis indexing of S for proper termination. (bow_idf_type): Added `bow_idf' as prefix to enum members, and removed `total' from end. * weight.c: Use new bow_idf enum names. (bow_wi2dvf_set_weights): Handle the case in which a document vector in the WI2DVF is NULL. * libbow.h: Include * heap.c (bow_make_dv_heap_from_wv): Make it work even when not all the words in WV have document vectors in WI2DVF. Keep separate indices into the word vector and into the heap. * int4str.c (HEADER_STRING): New macro. (bow_int4str_write): Write it to the FP. (bow_int4str_new_from_fp): Expected it from the FP. * array.c (HEADER_STRING): New macro. (bow_array_write): Write it to the FP. (bow_array_new_from_fp): Expected it from the FP. * bmalloc.c: Remove previous contents. Now get the functions directly from libbow.h. * io.c: Likewise. * Makefile.in (io.o bmalloc.o): Indicate that they now depend (completely) on libbow.h. * libbow.h (_BOW_MALLOC_INLINE_EXTERN): New macro for compiling these extern inline functions in library .o files. (_BOW_IO_INLINE_EXTERN): Likewise. (bow_fwrite*, bow_fread): Assert the return values. * int4docn.c (bow_docnames_write): Take FILE* argument instead of const char *. (bow_docnames_read_from_fp): Renamed frombow_docnames_read(), likewise as above. * libbow.h: Change argument types and function name for bow_docnames archiving. * docnames.c (bow_map_filenames_from_dir): Use renamed bow_verbosity_level enum. * libbow.h (bow_error): Don't print anything if bow_verbosity_level indicates bow_silent. Thu Oct 17 14:36:44 1996 Sean Slattery * split.c (bow_next_test_wv): Function now takes a pointer to a pointer to a bow_wv. It sets this to point to a pointer to the wv it creates and returns the integer document index to the test document described by this word vector. (bow_test_split): Fixed bug in counting that meant we sometimes ended up with fewer test docs than asked for. (bow_test_split): Random number generator is now seeded with time. * libbow.h: (bow_next_test_wv) Argument change as above. Wed Oct 16 08:35:45 1996 Andrew McCallum * libbow.h (bow_screaming): Renamed from bow_shutup_already. Commented all bow_verbosity_levels. * wi2dvf.c (bow_wi2dvf_write_data_file): Close the FP at the end! (bow_wi2dvf_new_from_data_file): Likewise. (bow_wi2dvf_new_from_data_fp): Don't assert feof(), because there may be multiple things written to one file. * io.c (bow_fread_string): Add parenthesis in order to dereference string pointer properly. * libbow.h: Comment changes to #include lines. (bow_fopen): New macro. * wi2dvf.c (bow_wi2dvf_new_from_data_fp): Renamed from bow_wi2dvf_new_from_fp. All callers changed. * libbow.h: Renamed function. * Makefile.in (LIBBOW_C_FILES): Added heap.c. * io.c (bow_fwrite_string): New function from libbow.h. (bow_fread_string): Likewise. Wed Oct 16 14:03:07 1996 Sean Slattery * weight.c: Checked for case total == 0 which can occur if no documents in the model had this word. Without this check, we get a floating point error when trying to divide by total * score.c: (bow_get_best_matches) Added support for a bow_array of cdocs. * weight.c: Messed up loop test on outer loop - Reset it to the max_wi which Andrew changed it to before. * libbow.h: Added defs for functions in split.c * split.c: New file with functions for dealing with test sets. * weight.c: Added bow_array *cdoc arguments to bow_wi2dvf_set_weights (so we can only do docs in the model), and to bow_wi2dvf_normalize_weights where we only calculate the length of docs in the model and we store the length in the corresponding cdoc structure. * libbow.h: Added include of string.h to stop compiler complaint on alpha Wed Oct 16 08:35:45 1996 Andrew McCallum * int4word.c (bow_words_write): Now takes FILE* argument instead of filename. (bow_words_read_from_fp): Renamed from bow_words_read_from_file, and likewise as above. * libbow.h (bow_words_read_from_fp): Renamed from bow_words_read. * libbow.h: Change argument types of bow_words_write function. (bow_error): Enclose expansion in parenthesis, so that it parses properly when put inside an `else' statement without brackets. * wi2dvf.c (bow_wi2dvf_write): New function. (bow_wi2dvf_write_data_file): Use it. This function is now deprecated. (bow_wi2dvf_new_from_fp): New function. (bow_wi2dvf_new_from_data_file): Use it. This function deprecated. (bow_wi2dvf_free): New function. * libbow.h: Declare new functions. Remove deprecated functions. * sarray.c (bow_sarray_write): New function. (bow_sarray_new_from_fp): New function. * libbow.h: Declare new functions. * wv.c (bow_wv_new): New function. (bow_wv_write_size): New function. (bow_wv_write): New function. (bow_wv_new_from_data_fp): New function. * libbow.h: Declare new functions. * weight.c (bow_wi2dvf_normalize_weights): Use renamed variable wv_length. * array.c (bow_array_write): New function. (bow_array_new_from_fp): New function. * libbow.h: Declare new functions. * int4str.c (bow_int4str_write): Make second argument a FILE* instead of a filename. (bow_int4str_new_from_fp): New function. * libbow.h: Declare new function. Update argument type. Tue Oct 15 10:07:42 1996 Andrew McCallum * wv.c (bow_wv_entry_for_wi): New function. * libbow.h: Declare new function. * libbow.h: Update for function name changes. (bow_class): New structure. * weight.c (bow_wi2dvf_normalize_weights): Renamed from bow_normalize_word_vectors. Minor format, comments and variable name changes. * configure.in: Check for existance of log2f() and sqrtf() functions. * weight.c (bow_wi2dvf_set_weights): Renamed from bow_assign_tfidf_weights because it is specific to wi2dvf structures, and we could imagine having a di2wvf structure in the future, and because we could imagine non-TFIDF weight-setting schemes. Don't loop over all word indices up to bow_num_words(), only loop up to the min of that and size of WI2DVF. Raise an error if there is an unrecognized TYPE. Fix bow_verbosify() call. Mon Oct 14 16:17:26 1996 Andrew McCallum * libbow.h: Indentation and comment fixes. * rainbow.c (main): Don't exit() prematurely. Actually write the data file and read it back in again. * wi2dvf.c (bow_wi2dvf_write_data_file): Use sizeof(int) instead of sizeof(long) since it better matches reality. * dv.c (bow_dv_write_size): Sum short's, not int's, or else we'll lie about the results of bow_dv_write. * getword.c (bow_get_word) [NON_ALPHA_IN_WORD]: New macro selecting new code that will reject a word if it contains any non-alphabetic characters. Current default is to include this code. * bitvec.c (bow_bitvec_new): Properly initialize all values to 0, not to 1. Fri Oct 11 17:37:28 1996 Andrew McCallum * bitvec.c: Finish and debug implementation. * libbow.h: Add bow_bitvec declarations. * Makefile.in (LIBBOW_C_FILES): Added bitvec.c. * bitvec.c: New file. Fri Oct 11 17:14:12 1996 Sean Slattery * libbow.h: Resolved a conflict in bow_cdoc / bow_doc definition. Thu Oct 10 09:44:55 1996 Andrew McCallum * stoplist.c (bow_stoplist_add_from_file): Don't raise an error if we can't open the file. This way, we can simply call the function with several "guessed" filenames. * libbow.h: Update comment for stoplist function. * getword.c (bow_get_word): Delineate words by space characters and non-printable characters, not by non-alphabetic characters, (but still reject words with "too many" digits). This is an effort to return entire email addresses and URL's as single words. * stoplist.c: Totally re-written using a bow_int4str. (bow_stoplist_present): Renamed from bow_on_stoplist. (bow_stoplist_add_from_file): New function. * libbow.h: Declare new stoplist functions. * Makefile.in (LIBBOW_C_FILES): Added stopwords.c. * getword.c (bow_get_word): Use renamed stoplist function. * email.c (bow_email_get_receivedid): New function. * libbow.h: Declare new email functions. * rainbow.c (main): Use new function name bow_wi2dvf_write_data_file(). * Makefile.in ($(DEMO_EXECUTABLES):): Depend on all the $(DEMO_O_FILES). * int4word.c (bow_num_words): Print error if WORD_MAP has not yet been initialized. * docnames.c (bow_map_filenames_from_dir): Don't forget to copy the CWD and the D_NAME into the FILENAME! * wv.c (bow_wv_count_for_wi): Return 0 if WV is NULL. * Makefile.in (LIBBOW_C_FILES): Added email.c. (DEMO_EXECUTABLES:): Changed rule to make $*.o separately. * email.c: New file. * libbow.h: Declared new email functions. * wi2dvf.c (bow_wi2dvf_write_data_file): Renamed from bow_wi2dvf_write(). * libbow.h: Rename function declaration. * wv.c (bow_wv_count_for_wi): New function. * libbow.h: Declare new function. * sarray.c (bow_sarray_index_at_keystr): New function. * libbow.h (bow_sarray_index_at_keystr): Declare new function. * sarray.c: New file. * docs.c: Old file, no longer used. * Makefile.in (LIBBOW_C_FILES): Add sarray.c. Remove docs.c. Temporarily remove heap.c because it hasn't been checked into the CVS, and I don't have access to it. * libbow.h (bow_sarray): New typedef, and new function declarations. (bow_cdoc): Renamed from bow_doc. SEEK_START and SEEK_LENGTH elements removed. Many users will need to define their own "document entries" with different elements; this is just one example typically used for classification. (bow_docs): Typedef removed. (bow_cdocs): New macro, a bow_array of cdoc's. Also add macro's for functions. (bow_wi2dvf_add_di_text_fp): Declare new function. * int4str.c (bow_int4str_init): New function. (bow_int4str_new): Use it. * array.c (bow_array_default_capacity): Renamed from bow_array_default_size. (bow_array_init): Use new name. (bow_array_append): Renamed from bow_array_add_at_index, since the user really doesn't have a choice of index anyway. No INDEX argument now. * wi2dvf.c (bow_wi2dvf_add_di_text_fp): New function. (bow_wi2dvf_new_from_text_dir): Use it. Wed Oct 9 15:53:52 1996 Andrew McCallum * array.c (bow_array_add_at_index): Include ENTRY_SIZE in calculation of realloc() size. Tue Oct 8 14:38:59 1996 Andrew McCallum * libbow.h (bow_array): New structure and suite of functions. (bow_docs): Use it. * Makefile.in (LIBBOW_C_FILES): Added array.c. Renamed doc.c to docs.c. * array.c: New file. * docs.c: New file. * Makefile.in (LIBBOW_C_FILES): Added doc.c. Tue Oct 8 15:27:45 1996 Sean Slattery * libbow.h: Added definitions for heap functions, weight functions and scoring functions. Also added length field to bow_doc structure. * Makefile.in (LIBBOW_C_FILES): Added score.c, weight.c and heap.c. * heap.c: New file. * weight.c: New file. * score.c: New file. Mon Oct 7 12:14:50 1996 Andrew McCallum * docnames.c (bow_map_filenames_from_dir): New function. (bow_doc_list_append): Use it to do most of the work. * libbow.h: Declare new function. * Makefile.in (snapshot): New target. * getword.c (bow_get_word): Avoid returning a post-stemmed word of length 1. * libbow.h (bow_wv): Renamed member "length" to "num_entries". Added member "length", meaning Euclidean length of the vector. (bow_doc): Added member "class". Removed member "wv". * wv.c: Use new member name "num_entries". * wi2dvf.c: Likewise. * Makefile.in (DIST_FILES): Added HACKING. Sat Oct 5 18:26:47 1996 Andrew McCallum * Version (LIBBOW_VERSION): Version 0.2. * libbow.texi: Cleaned up and added some sections. * dv.c (bow_dv_add_di_count): Fix bugs in calculation of DV_INDEX. In an effort to reduce wasted memory, don't reallocate double the previous SIZE, but 3/2 the previous size; this almost cuts in half the amount of wasited "document vector" memory; (perhaps multiplying 4/3 would help even more?). * wi2dvf.c (bow_wi2dvf_dv): Use new function name bow_dv_new_from_data_fp(). (bow_wi2dvf_print_stats): Fix typo. Also print average number of unused document vector entries. (bow_wi2dvf_new_from_text_dir): Don't use "word vectors". Instead grab each word individually from a text file, and add it to the map using bow_wi2dvf_add_wi_di_count(). (bow_wi2dvf_add_wi_di_count): Newly implemented. * libbow.h (bow_dv_new_from_data_fp): Renamed from bow_dv_new_from_fp. * dv.c (bow_dv_add_di_count): Don't use a new "document entry" if the "document vector" already has an entry for the given DI. * wi2dvf.c (bow_wi2dvf_print_stats): Print stats about number of used and unused "document entries" to get a better idea of memory usage. * rainbow.c (main): Use getopt() to enable setting of bow_verbosity_level. Wed Oct 2 11:20:58 1996 Andrew McCallum * libbow.h (bow_wi2dvf_add_wi_di_count): New function declaration; not yet implemented. * rainbow.c (main): Don't set bow_verbosity_level to bow_quiet. * docnames.c: Change many FL variable names to DL. (bow_doc_list_append): Don't set *DL to NULL at the beginning, because it won't work recursively. * wi2dvf.c (bow_wi2dvf_new_from_text_dir): Add assertion that verifies length of the document list. * Version 0.0. CVS rtag with `release-0-0'. * wi2dvf.c (bow_wi2dvf_new_from_text_dir): Clean up and count text files and binary files differently. * rainbow.c (main): Comment out setting to bow_quiet. * Makefile.in: Include Version. (version.texi): Fix dependancy. (dist): Fix it. * docnames.c (bow_doc_list_append): Don't print extra newline. (bow_doc_list_length): New function. * libbow.h (bow_de): Define di and count as short int's, not int's. (bow_fwrite_short): New function. (bow_fread_short): New function. (bow_doc_list_length): Declare new function. * dv.c (bow_dv_write): Write di and count as short ints. (bow_dv_new_from_fp): Read them as short ints. * io.c (bow_fwrite_short): New function. (bow_fread_short): New function. * Version: New file. * libbow-desc.texi: New file. * Makefile.in (clean): Fix name of libbow.a; also remove the $(DEMO_EXECUTABLES). * rainbow.c (main): Print messages during stages of wi2dvf map testing. Clean up the other test code. * wi2dvf.c: (bow_wi2dvf_print_stats): New function. * dv.c: (bow_dv_default_capacity): Decreased from 512 to 4 in an attempt to avoid exhausted memory. (bow_dv_count): New global variable. (bow_dv_new): Increment it. (bow_dv_free): Decrement it. * libbow.h (bow_malloc): New function. (bow_realloc): New function. (bow_free): New function. * wv.c: Use bow_malloc() instead of malloc(). * stoplist.c: Likewise. * primes.c: Likewise. * int4str.c: Likewise. * docnames.c: Likewise. * dv.c: Likewise. * wi2dvf.c: Likewise. * Makefile.in (LIBBOW_C_FILES): Added bmalloc.c. * Placed under CVS with release-tag `first'.