Technical Reports Published in 1998
Implementing Voting Constraints with Finite State TransducersWe describe a constraint-based morphological disambiguation system in which individual constraint rules vote on matching morphological parses followed by its implementation using finite state transducers. Voting constraint rules have a number of desirable properties: The outcome of the disambiguation is independent of the order of application of the local contextual constraint rules. Thus the rule developer is relieved from worrying about conflicting rule sequencing. The approach can also combine statistically and manually obtained constraints, and incorporate negative constraints that rule out certain patterns. The transducer implementation has a number of desirable properties compared to other finite state tagging and light parsing approaches, implemented with automata intersection. The most important of these is that since constraints do not remove parses there is no risk of an overzealous constraint ``killing a sentence'' by removing all parses of a token during intersection. After a description of our approach we present preliminary results from tagging the Wall Street Journal Corpus with this approach. With about 400 statistically derived constraints and about 570 manual constraints, we can attain an accuracy of 97.82% on the training corpus and 97.29% on the test corpus. We then describe a finite state implementation of our approach and discuss various related issues.
A Supervised Machine Learning Algorithm for Arrhythmia AnalysisA new machine learning algorithm for the diagnosis of cardiac arrhythmia from standard 12 lead ECG recordings is presented. The algorithm is called VFI5 for Voting Feature Intervals. VFI5 is a supervised and inductive learning algorithm for inducing classification knowledge from examples. The input to VFI5 is a training set of records. Each record contains clinical measurements, from ECG signals and some other information such as sex, age, and weight, along with the decision of an expert cardiologist. The knowledge representation is based on a recent technique called Feature Intervals, where a concept is represented by the projections of the training cases on each feature separately. Classification in VFI5 is based on a majority voting among the class predictions made by each feature separately. The comparison of the VFI5 algorithm indicates that it outperforms other standard algorithms such as Naive Bayesian and Nearest Neighbor classifiers.
Introduction to the Fractional Fourier Transform and Its ApplicationsPlease note the numbering with the report above.
Tagging English by Path Voting ConstraintsWe describe a constraint-based tagging approach where individual constraint rules vote on sequences of matching tokens and tags. Disambiguation of all tokens in a sentence is performed at the very end by selecting tags that appear on the path that receives highest vote. This constraint application paradigm makes the outcome of the disambiguation independent of the rule sequence, and hence relieves the rule developer from worrying about potentially conflicting rule sequencing found in other systems. The approach can also combine statistically and manually obtained constraints, and incorporate negative constraint rules that rule out certain patterns. We have applied this approach to tagging English text from the Wall Street Journal and the Brown Corpora. Our results from the Wall Street Journal Corpus indicate that with 400 statistically derived constraint rules and about 657 hand-crafted constraint rules, we can attain an average accuracy of 97.56% on the training corpus and an average accuracy of 97.12 on the testing corpus with 11-fold cross-validation. We can also relax the single tag per token limitation and allow ambiguous tagging which lets us trade recall and precision.
NOTE: This report is an extensively revised version of BU-CEIS-9704.
A Finite-State Kernel Architecture for Turkish Natural Language ProcessingWe present a finite-state kernel architecture for Turkish that performs certain presyntactic processing steps on a given sentence, such as morphological analysis, and recognition of lexicalized and nonlexicalized collocations, followed by morphological disambiguation by voting constraints. The kernel has been implemented using the Xerox Finite State Tools. The approach to recognizing collocations presented here is of particular interest for languages with highly agglutinative morphology (of which Turkish is a very good example), since it only requires a single mechanism to deal with a potentially infinite number of variants of a single collocation in certain cases. Moreover, it may also help resolve morphological ambiguity to some degree.
Comparison of Partitioning Techniques for Two-Level Iterative Solvers on Large, Sparse Markov ChainsExperimental results for large, sparse Markov chains, especially the ill-conditioned nearly completely decomposable (NCD) ones, are few. We believe there is need for further research in this area, specifically to help in understanding the effects of the degree of coupling of NCD Markov chains and their nonzero structure on the convergence characteristics and space requirements of iterative solvers. The work of several researchers has raised the following questions that led to research in a related direction. How one must go about partitioning the global coefficient matrix into blocks when the system is NCD and a two-level iterative solver (such as block SOR) is to be employed? Are block partitionings dictated by the NCD normal form of the stochastic one-step transition probability matrix necessarily superior to others? Is it worth investing alternative partitionings? Better yet, for a fixed labeling and partitioning of the states, how does the performance of block SOR (or even that of point SOR) compare to the performance of the iterative aggregation-disaggregation (IAD) algorithm? Finally, is there any merit in using two-level iterative solvers when preconditioned Krylov subspace methods are available? We seek answers to these questions on a test suite of thirteen Markov chains arising in seven applications.
An HPSG Account of Relativization in Turkish Using Relational ConstraintsIn this paper we propose an HPSG account of the complex phenomenon of relativization in Turkish. Relative clauses in this language are prenominal and headed by a participle whose form depends on the existence of a genitive subject in the clause in the case of bounded dependencies and on two further factors, namely the grammatical function of the gap host and the gap, in the case of long-distance dependencies. Previous accounts have all been transformational. We present an account for this phenomenon in HPSG, making use of lexically specified
MODvalues, valency lists and non-local feature handling. We outline a comprehensive analysis that exploits relational constraints on lexical entries.
Left-to-Right Constraint-based Parsing of Head-final LanguagesWe present a parsing algorithm for HPSG grammars that are specified in the form of a type hierarchy, with the constraints governing certain types imposed on the respective types in the hierarchy. The algorithm works directly on the representations provided by the HPSG formalism, making essential use of the selection features governed by the formalism. Parsing a string starts with an underspecified structure (assigned to that string), and proceeds by attaching every word of the input, from left to right, to that global structure, thereby dynamically changing it as the parse progresses. We propose certain strategies for a head-final language that guarantee the correct parse/parses with the least possible number of processing steps in most cases (and with minimal reanalysis in the remaining ones), which makes the algorithm particularly interesting for such languages.
Permuting Markov Chains to Nearly Completely Decomposable FormThis paper highlights an algorithm that computes, if possible, a nearly completely decomposable (NCD) partitioning for a given Markov chain using a specified decomposability parameter. The algorithm is motivated by search for connected components (CCs) of an undirected graph. The nestedness of the NCD partitionings for increasing values of the decomposability parameter is demonstrated on the Courtois matrix. The relation among the degree of coupling, the smallest eigenvalue of the coupling matrix and the magnitude of the subdominant eigenvalue of the block Gauss-Seidel (BGS) iteration matrix induced by the underlying NCD partitionings is investigated on the same matrix. Experimental results that appear elsewhere show that the partitioning algorithm may be used successfully in two-level iterative solvers such as block successive over-relaxation (BSOR) and iterative aggregation-disaggregation (IAD).
Application of k-Nearest Neighbor of Feature Projections Classifier to Text CategorizationThis paper presents the results of the application of an instance-based learning algorithm k-Nearest Neighbor Method on Feature Projections (k-NNFP) to text categorization and compares it with k-Nearest Neighbor Classifier (k-NN). k-NNFP is similar to k-NN except it finds the nearest neighbors according to each feature separately. Then it combines these predictions using a majority voting. This property causes k-NNFP to eliminate possible adverse effects of irrelevant features on the classification accuracy. Experimental evidence indicates that k-NNFP is superior to k-NN in terms of classification accuracy in the presence of irrelevant features in many real world domains.
A Classification Learning Algorithm Robust to Irrelevant FeaturesPresence of irrelevant features is a fact of life in many real-world applications of classification learning. Although nearest-neighbor classification algorithms have emerged as a promising approach to machine learning tasks with their high predictive accuracy, they are adversely affected by the presence of such irrelevant features. In this paper, we describe a recently proposed classification algorithm called VFI5, which achieves comparable accuracy to nearest-neighbor classifiers while it is robust with respect to irrelevant features. The paper compares both the nearest-neighbor classifier and the VFI5 algorithms in the presence of irrelevant features on both artificially generated and real-world data sets selected from the UCI repository.
Multilevel Heuristics for Task Assignment in Distributed Systems (M.Sc. Thesis)Task assignment problem deals with assigning tasks to processors in order to minimize the sum of execution and communication costs in a distributed system. In this work, we propose a novel task clustering scheme which considers the differences between the execution times of tasks to be clustered as well as the communication costs between them. We use this clustering approach with proper assignment schemes to implement two-phase assignment algorithms which can be used to find suboptimal solutions to any task assignment problem. In addition, we adapt the multilevel scheme used in graph/hypergraph partitioning to the task assignment. Multilevel assignment algorithms reduce the size of the original problem by collapsing tasks, find an initial assignment on the smaller problem, and then projects it towards the original problem by successively refining the assignment at each level. We propose several clustering schemes for multilevel assignment algorithms. The performance of all proposed algorithms are evaluated through an experim ental study where the assignment qualities are compared with two up-to-date heuristics. Experimental results show that our algorithms substantially outperform both of the existing heuristics.
An Information-Based Approach to Punctuation (Ph. D. Thesis)Punctuation marks have special importance in bringing out the meaning of a text. Geoffrey Nunberg's 1990 monograph bridged the gap between descriptive treatments of punctuation and prescriptive accounts, by spelling out the features of a text-grammar for the orthographic sentence. His research inspired most of the recent work concentrating on punctuation marks in Natural Language Processing (NLP). Several grammars incorporating punctuation were then shown to reduce failures and ambiguities in parsing. Nunberg's approach to punctuation (and other formatting devices) was partially incorporated into natural language generation systems. However, little has been done concerning how punctuation marks bring semantic and discourse cues to the text and whether these can be exploited computationally.
The aim of this thesis is to analyse the semantic and discourse aspects of punctuation marks, within the framework of Hans Kamp and Uwe Reyle's Discourse Representation Theory (DRT) (and its extension by Nicholas Asher, Segmented Discourse Representation Theory (SDRT)), drawing implications for NLP systems. The method used is the extraction of patterns for four common punctuation marks (dashes, semicolons, colons, and parentheses) from corpora, followed by formal modeling and a modest computational prototype. Our observations and results have revealed interesting occurrences of linguistic phenomena, such as anaphora resolution and presupposition, in conjunction with punctuation marks. Within the framework of SDRT such occurrences are then tied with the overall discourse structure. The proposed model can be taken as a template for NLP software developers for making use of the punctuation marks more effectively. Overall, the thesis describes the contribution of punctuation at the orthographic sentence level to the information passed on to the reader of a text.