All technical reports files are in compressed (gzip) postscript format.
The file names are of the sort BU-CEIS-YYXX.ps.gz or BU-CEIS-YYXX.pdf.
BU-CEIS-9801
TITLE: Implementing Voting Constraints with Finite State Transducers
AUTHORS: Kemal Oflazer and Gökhan Tür
ABSTRACT: We describe a constraint-based morphological disambiguation
system in which individual constraint rules vote on matching
morphological parses followed by its implementation using finite state
transducers. Voting constraint rules have a number of desirable
properties: The outcome of the disambiguation is independent of the
order of application of the local contextual constraint rules. Thus
the rule developer is relieved from worrying about conflicting rule
sequencing. The approach can also combine statistically and manually
obtained constraints, and incorporate negative constraints that rule
out certain patterns. The transducer implementation has a number of
desirable properties compared to other finite state tagging and light
parsing approaches, implemented with automata intersection. The most
important of these is that since constraints do not remove parses
there is no risk of an overzealous constraint ``killing a sentence''
by removing all parses of a token during intersection. After a
description of our approach we present preliminary results from
tagging the Wall Street Journal Corpus with this approach. With about
400 statistically derived constraints and about 570 manual
constraints, we can attain an accuracy of 97.82% on the training
corpus and 97.29% on the test corpus. We then describe a finite
state implementation of our approach and discuss various related
issues.
BU-CEIS-9802
TITLE: A Supervised Machine Learning Algorithm for Arrhythmia Analysis
AUTHORS: H. A. Güvenir, B. Acar, G. Demiröz, A. Cekin,
ABSTRACT:A new machine learning algorithm for the diagnosis
of cardiac arrhythmia from standard 12 lead ECG recordings is
presented. The algorithm is called VFI5 for Voting Feature Intervals.
VFI5 is a supervised and inductive learning algorithm for inducing
classification knowledge from examples. The input to VFI5 is a
training set of records. Each record contains clinical measurements,
from ECG signals and some other information such as sex, age, and
weight, along with the decision of an expert cardiologist. The
knowledge representation is based on a recent technique called Feature
Intervals, where a concept is represented by the projections of the
training cases on each feature separately. Classification in VFI5 is
based on a majority voting among the class predictions made by each
feature separately. The comparison of the VFI5 algorithm indicates
that it outperforms other standard algorithms such as Naive Bayesian
and Nearest Neighbor classifiers.
BU-CEIS-9802
TITLE: Introduction to the fractional Fourier transform and its applications
AUTHORS: Haldun M. Ozaktas, M. Alper Kutay, and David Mendlovic
Please note duble numbering with the report above.
BU-CEIS-9803
TITLE: Tagging English by Path Voting Constraints
(A much revised version of BU-CEIS-9704)
AUTHORS: Gökhan Tür and Kemal Oflazer
ABSTRACT: We describe a constraint-based tagging approach where
individual constraint rules vote on sequences of matching tokens and
tags. Disambiguation of all tokens in a sentence is performed at
the very end by selecting tags that appear on the path that receives
highest vote. This constraint application paradigm makes the
outcome of the disambiguation independent of the rule
sequence, and hence relieves the rule developer from worrying about
potentially conflicting rule sequencing found in other systems. The
approach can also combine statistically and manually obtained
constraints, and incorporate negative constraint rules that rule out
certain patterns. We have applied this approach to tagging English
text from the Wall Street Journal and the Brown Corpora. Our results
from the Wall Street Journal Corpus indicate that with 400
statistically derived constraint rules and about 657 hand-crafted
constraint rules, we can attain an average accuracy of 97.56% on
the training corpus and an average accuracy of 97.12 on
the testing corpus with 11-fold cross-validation. We can also relax
the single tag per token limitation and allow ambiguous tagging which
lets us trade recall and precision.
BU-CEIS-9804
TITLE: A Finite-State Kernel Architecture for Turkish Natural
Language Processing
AUTHORS: Zelal Güngördü and Kemal Oflazer
ABSTRACT: We present a finite-state kernel architecture for Turkish that
performs certain presyntactic processing steps on a given
sentence, such as morphological analysis, and recognition of
lexicalized and nonlexicalized collocations, followed by
morphological disambiguation by voting constraints. The kernel has been implemented using the Xerox Finite State Tools. The approach to recognizing
collocations presented here is of particular interest for
languages with highly agglutinative morphology (of which Turkish is a
very good example), since it only requires a single mechanism to deal
with a potentially infinite number of variants of a single
collocation in certain cases. Moreover, it may also help resolve
morphological ambiguity to some degree.
BU-CEIS-9805
TITLE: Comparison of Partitioning Techniques for Two-Level Iterative
Solvers on Large, Sparse Markov Chains
AUTHORS: Tugrul Dayar and William J. Stewart
ABSTRACT: Experimental results for large, sparse Markov chains, especially
the ill-conditioned nearly completely decomposable (NCD) ones, are few.
We believe there is need for further research in this area, specifically to
help in understanding the effects of the degree of coupling of NCD Markov
chains and their nonzero structure on the convergence characteristics and
space requirements of iterative solvers. The work of several researchers
has raised the following questions that led to research in a related
direction. How one must go about partitioning the global coefficient matrix
into blocks when the system is NCD and a two-level iterative solver (such as
block SOR) is to be employed? Are block partitionings dictated by the NCD
normal form of the stochastic one-step transition probability matrix
necessarily superior to others? Is it worth investing alternative
partitionings? Better yet, for a fixed labeling and partitioning of the
states, how does the performance of block SOR (or even that of point SOR)
compare to the performance of the iterative aggregation-disaggregation (IAD)
algorithm? Finally, is there any merit in using two-level iterative solvers
when preconditioned Krylov subspace methods are available? We seek answers
to these questions on a test suite of thirteen Markov chains arising in
seven applications.
BU-CEIS-9806
TITLE: An HPSG Account of Relativization in Turkish Using Relational Constraints
AUTHORS: Zelal Güngördü and Elisabet Engdahl (University of Gothenburg, Sweden)
ABSTRACT: In this paper we propose an HPSG account of the complex
phenomenon of relativization in Turkish. Relative clauses in this language are
prenominal and headed by a participle whose form depends on the
existence of a genitive subject in the clause in the case of bounded
dependencies and on two further factors, namely the grammatical function
of the gap host and the gap, in the case of long-distance
dependencies. Previous accounts have all been transformational. We
present an account for this phenomenon in HPSG, making use of
lexically specified MOD values, valency lists and
non-local feature handling. We outline a comprehensive analysis that
exploits relational constraints on lexical entries.
BU-CEIS-9807
TITLE: Left-to-Right Constraint-based Parsing
of Head-final Languages
AUTHORS: Zelal Güngördü
ABSTRACT: We present a parsing algorithm for HPSG grammars that are
specified in the form of a type hierarchy, with the constraints
governing certain types imposed on the respective types in the
hierarchy. The algorithm works directly on the representations
provided by the HPSG formalism, making essential use of the
selection features governed by the formalism. Parsing a string starts
with an underspecified structure (assigned to that string),
and proceeds by attaching every word of the input, from left to
right, to that global structure, thereby dynamically changing
it as the parse progresses. We propose certain strategies for a
head-final language that guarantee the correct
parse/parses with the least possible number of processing steps
in most cases (and with minimal reanalysis in the remaining ones),
which makes the algorithm particularly interesting for such
languages.
BU-CEIS-9808
TITLE: Permuting Markov Chains to Nearly Completely Decomposable Form
AUTHORS: Tugrul Dayar
ABSTRACT: This paper highlights an algorithm that computes, if possible, a
nearly completely decomposable (NCD) partitioning for a given Markov chain
using a specified decomposability parameter. The algorithm is motivated by
search for connected components (CCs) of an undirected graph. The nestedness
of the NCD partitionings for increasing values of the decomposability
parameter is demonstrated on the Courtois matrix. The relation among the
degree of coupling, the smallest eigenvalue of the coupling matrix and the
magnitude of the subdominant eigenvalue of the block Gauss-Seidel (BGS)
iteration matrix induced by the underlying NCD partitionings is investigated
on the same matrix. Experimental results that appear elsewhere show that the
partitioning algorithm may be used successfully in two-level iterative
solvers such as block successive over-relaxation (BSOR) and iterative
aggregation-disaggregation (IAD).
BU-CEIS-9809
TITLE: Application of k-Nearest Neighbor of Feature Projections
Classifier to Text Categorization
AUTHORS: T. Yavuz and H. A. Güvenir
ABSTRACT: This paper presents the results of the application of an
instance-based learning algorithm k-Nearest Neighbor Method on
Feature Projections (k-NNFP) to text categorization and
compares it with k-Nearest Neighbor Classifier (k-NN). k-NNFP is similar to k-NN except it
finds the nearest neighbors according to each feature separately. Then it
combines these predictions using a majority voting. This property causes
k-NNFP to eliminate possible adverse effects of irrelevant
features on the classification accuracy. Experimental evidence
indicates that k-NNFP is superior to k-NN in terms of
classification accuracy in the presence of irrelevant features in many
real world domains.
BU-CEIS-9810
TITLE: A Classification Learning Algorithm Robust to Irrelevant Features
AUTHORS: H. A. Güvenir
ABSTRACT: Presence of irrelevant features is a fact of life in many
real-world applications of classification learning.
Although nearest-neighbor classification algorithms have emerged
as a promising approach to machine learning tasks with their high
predictive accuracy, they are adversely affected by the presence of such
irrelevant features.
In this paper, we describe a recently proposed classification
algorithm called VFI5, which achieves comparable accuracy to
nearest-neighbor classifiers while it is robust with respect to
irrelevant features.
The paper compares both the nearest-neighbor classifier and the VFI5
algorithms in the presence of irrelevant features on both artificially
generated and real-world data sets selected from the UCI repository.
BU-CEIS-9811
TITLE: Multilevel Heuristics for Task Assignment in Distributed Systems
(M.Sc. Thesis)
AUTHORS: Murat Ikinci
ABSTRACT: Task assignment problem deals with assigning tasks to
processors in order to minimize the sum of execution and communication
costs in a distributed system. In this work, we propose a novel task
clustering scheme which considers the differences between the
execution times of tasks to be clustered as well as the communication
costs between them. We use this clustering approach with proper
assignment schemes to implement two-phase assignment algorithms which
can be used to find suboptimal solutions to any task assignment problem.
In addition, we adapt the multilevel scheme used in graph/hypergraph
partitioning to the task assignment. Multilevel assignment algorithms
reduce the size of the original problem by collapsing tasks, find an
initial assignment on the smaller problem, and then projects it
towards the original problem by successively refining the assignment
at each level. We propose several clustering schemes for multilevel
assignment algorithms. The performance of all proposed algorithms are
evaluated through an experim ental study where the assignment
qualities are compared with two up-to-date heuristics. Experimental
results show that our algorithms substantially outperform both of the
existing heuristics.
BU-CEIS-9812
TITLE: An Information-Based Approach to Punctuation
(Ph.D. Thesis)
AUTHORS: Bilge Say
ABSTRACT: Punctuation marks have special importance in bringing out
the meaning of a text. Geoffrey Nunberg's 1990 monograph bridged the
gap between descriptive treatments of punctuation and prescriptive
accounts, by spelling out the features of a text-grammar for the
orthographic sentence. His research inspired most of the recent work
concentrating on punctuation marks in Natural Language Processing
(NLP). Several grammars incorporating punctuation were then shown to
reduce failures and ambiguities in parsing. Nunberg's approach to
punctuation (and other formatting devices) was partially incorporated
into natural language generation systems. However, little has been
done concerning how punctuation marks bring semantic and discourse
cues to the text and whether these can be exploited computationally.
The aim of this thesis is to analyse the semantic and discourse
aspects of punctuation marks, within the framework of Hans Kamp and
Uwe Reyle's Discourse Representation Theory (DRT) (and its extension
by Nicholas Asher, Segmented Discourse Representation Theory (SDRT)),
drawing implications for NLP systems. The method used is the
extraction of patterns for four common punctuation marks (dashes,
semicolons, colons, and parentheses) from corpora, followed by formal
modeling and a modest computational prototype. Our observations and
results have revealed interesting occurrences of linguistic phenomena,
such as anaphora resolution and presupposition, in conjunction with
punctuation marks. Within the framework of SDRT such occurrences are
then tied with the overall discourse structure. The proposed model
can be taken as a template for NLP software developers for making
use of the punctuation marks more effectively. Overall, the thesis
describes the contribution of punctuation at the orthographic
sentence level to the information passed on to the reader of a text.