All technical reports files are in compressed (gzip) postscript format.
The file names are of the sort BU-CEIS-YYXX.ps.gz
BU-CEIS-9901
TITLE: Dependency Parsing with an Extended Finite State Approach
AUTHOR: Kemal Oflazer
ABSTRACT: This paper presents a dependency parsing scheme using an
extended finite state approach. The parser augments input
representation with "channels" so that links representing syntactic
dependency relations among words can be accommodated, and
iterates on the input a number of times to arrive at a fixed point.
Intermediate configurations violating various constraints of
projective dependency representations such as no crossing links, no independent
items except sentential head, etc, are filtered via finite state
filters. We have applied the parser to dependency parsing of
Turkish.
BU-CEIS-9902
TITLE: Parallelization of Hierarchical Radiosity Algorithms on
Distributed Memory Computers
AUTHOR: Ahmet Resat Sireli
ABSTRACT: Computing distribution of light in a given environment is an
important problem in computer-aided photo-realistic image
generation. Radiosity method has been proposed to address this problem
which requires an enormous amount of calculation and
memory. Hierarchical radiosity method is a recent approach that
reduces these computational requirements by careful error analysis. It
has its idea from the solution methods of N-body problems. Although
hierarchical approach has greatly reduced the amount of calculations,
satisfactory results still cannot be obtained in terms of processing
time. Exploiting parallelism is a practical way to reduce the
computation time further. In this thesis, we have designed and
implemented a parallel hierarchical radiosity algorithm for
distributed memory computers. Due to its highly irregular
computational structure, hierarchical radiosity algorithms do not
yield easily to parallelization on distributed memory
machines. Dynamically changing computational patterns of the algorithm
cause severe load imbalances. Therefore, we have developed a dynamic
load balancing technique for the parallel hierarchical radiosity
calculation.
BU-CEIS-9903
TITLE: Design for a Turkish Treebank
AUTHORS: Kemal Oflazer, Dilek Zeynep Hakkani-Tür, Gökhan Tür
ABSTRACT: We present the issues that we have encountered in designing a
treebank for Turkish along with rationale for the choices we have made
for various representation schemes. In the resulting
representation, the information encoded in the complex agglutinative
word structures are represented as a sequence of inflectional groups
separated by derivational boundaries. The syntactic relations are
encoded as labeled dependency relations among segments of lexical
items marked by derivation boundaries.
BU-CEIS-9904
TITLE: Processing of Continuous Queries from Moving Objects in Mobile Computing Systems
AUTHORS: Huseyin Gokmen Gok
ABSTRACT: Recent advances in computer hardware technology and wireless
communication networks have led to the emergence of mobile computing
systems. In a mobile computing environment, a user with a wireless
connection to the information network can access data via submitting
queries to the data server. Since the mobility is the most
distinguishing feature of the mobile computing paradigm, location
becomes an important piece of information for the so called
"location-dependent queries" where the answer to a query depends on
the current location of the user who issued the query. A
location-dependent query submitted by a mobile user can become more
difficult to process when it is submitted as a "continuous query" for
which the answer changes as the user moves. The answer to a
location-dependent continuous query is a set that consists of tuples
indicating that object S is the answer of the query
from time begin to time end. Once the tuples in the answer set are
determined, the next step is to determine when to send these tuples to
the user. The transmission time of the tuples is critical in the sense
that it can affect the communication overhead imposed on the wireless
network and the availability of tuples in case of disconnections. In
this thesis, we propose three tuple transmission approaches that
determine the transmission time of a tuple in the answer set of a
location-dependent continuous query. We also design and implement a
simulation model to compare the performance of the proposed tuple
transmission approaches under different settings of environmental
parameters.
BU-CEIS-9905
TITLE: Turing Test: 50 Years Later
AUTHORS: Ayse Pinar Saygin, Ilyas Cicekli, Varol Akman
ABSTRACT: The Turing Test is one of the most disputed topics in
Artificial Intelligence, Philosophy of Mind and Cognitive
Science. This paper is a review of the past 50 years of the Turing
Test. Philosophical debates, practical developments and repercussions
in related disciplines are all covered. We discuss Turing's ideas in
detail and present the important comments that have been made on
them. Within this context, behaviorism, consciousness, the 'other
minds' problem and similar topics in the philosophy of mind are
discussed. We also cover the sociological and psychological aspects of
the Turing Test. Finally, we take a look at the current situation and
analyze the programs that have been developed with the aim of passing
the Turing Test. We conclude that the Turing Test has been, and will
continue to be, a very influential and controversial topic.
BU-CEIS-9906
TITLE: Bankruptcy Prediction Using Feature Projection Based
Classification
AUTHORS: H. Altay Guvenir, Sengor Altingovde, Ilhan Uysal
and Erdal Erel
ABSTRACT:
Bankruptcy prediction has been an important decision-making
process for financial analysts. One of the most common approaches for
the bankruptcy prediction problem is the Discriminant Analysis. Also,
the k-Nearest Neighbor classifier is very successful in such
domains. This paper proposes a Feature Projection based classification
algorithm, and explores its applicability to the problem of predicting
bankruptcy of large firms. The algorithm is evaluated on a particular
data set, and its performance is compared with the techniques
mentioned above. The experiments indicate that the feature projection
based classification algorithm introduced here performs better than
these techniques.
BU-CEIS-9907
TITLE: An Application of Inductive Learning for Mining Basket Data
AUTHORS: Ilhan Uysal and H. Altay Guvenir
ABSTRACT: The development of bar-code technology provided accurate and
large market databases for researchers who deal with datasets. Since the
data is large both in dimension and size* most exploratory analysis
techniques of statistics are not appropriate for such tasks. In this
paper, we describe a high-level algorithm, and the application of it
on a large basket data, extracted from the database of a big
supermarket company. The algorithm have two consecutive steps.
Each step is a different popular machine learning method: clustering
and classification. In this application, we used KMEANS clustering
algorithm and C4.5 classification program respectively. At the end of
the application we come up with a set of items that can be employed
for promotion. By promotion we aim to increase number of costumers
that make their weekly or monthly shopping* which refer to full
baskets among transactions.
BU-CEIS-9908
TITLE: An Efficient Algorithm to Update Large Itemsets with Early Pruning
AUTHORS: Necip Fazil Ayan, Abdullah Uz Tansel and Erol Arkun
ABSTRACT: Although many efficient algorithms have been proposed for
the discovery of association rules, the process of updating large
itemsets is still a complicated issue for dynamic databases that
involve frequent additions. We present an efficient algorithm for
updating large itemsets (UWEP) when new transactions are added to the
set of old transactions in a transaction database. UWEP employs a
dynamic look-ahead strategy in updating the existing large itemsets by
detecting and removing those that will no longer remain large after
the contribution of the new set of transactions. UWEP executes
iteratively, but it differs from the other proposed algorithms by
scanning the existing database at most once and the new database
exactly once. Moreover, it generates and counts the minimum number of
candidates in the new database. The experiments on synthetic data show
that UWEP outperforms the existing algorithms in terms of the
candidates generated and counted.
Keywords: Maintenance of association rules, dynamic pruning, large
itemsets.
BU-CEIS-9909
TITLE: Regression by Feature Projections
AUTHORS: Ilhan Uysal and H. Altay Guvenir
ABSTRACT: This paper describes a machine learning method called,
Regression by Feature Projections (RFP) for predicting a real-valued
target feature. In RFP training is based on simply storing the
projections of the training instances on each feature
separately. Prediction of the target value for a query point is
obtained through two approximation procedures executed
sequentially. The first approximation process is to find the
individual predictions of features by using the K-nearest neighbor
algorithm (KNN). The second approximation process combines the
predictions of all features. During the first approximation step, each
feature is associated with a weight in order to determine the
prediction ability of the feature at the local query point. The
weights, found for each local query point, are used in the second step
and enforce the method to have an adaptive or context-sensitive
nature. We have compared RFP with the KNN algorithm. Results on real
data sets show that RFP is much faster than KNN, yet its prediction
accuracy is comparable with the KNN algorithm
BU-CEIS-9910
TITLE: Application of Machine Learning Techniques to Differential
Diagnosis of Erythemato-Squamous Diseases
AUTHORS: Narin Emeksiz and H. Altay Guvenir
ABSTRACT: This paper is about the implementation of a visual tool for
Differential Diagnosis of Erythemato-Squamous Diseases based on the
classification algorithms; Nearest Neighbor Classifier (NN), Naive
Bayesian Classifier using Normal Distribution (NBC) and Voting Feature
Intervals-5 (VFI5). This tool enables the doctors to differentiate six
types of Erythemato-Squamous Diseases using clinical and
histopathological parameters obtained from a patient. The program also
gives explanations for the classifications of each classifier.
Keywords: Machine Learning, Classification, Dermatology, Feature
Projections.
BU-CEIS-9911
TITLE: Turing Test and Conversation
AUTHORS: Ayse Pinar Saygin
ABSTRACT: The Turing Test is one of the most disputed topics in Artificial
Intelligence, Philosophy of Mind and Cognitive Science. It has been
proposed 50 years ago, as a method to determine whether machines can
think or not. It embodies important philosophical issues, as well as
computational ones. Moreover, because of its characteristics, it
requires interdisciplinary attention. The Turing Test posits that, to
be granted intelligence, a computer should imitate human conversational
behavior so well that it should be indistinguishable from a real human
being. From this, it follows that conversation is a crucial concept in
its study. Surprisingly, focusing on conversation in relation to the
Turing Test has not been a prevailing approach in previous
research. This thesis first provides a thorough and deep review of the
50 years of the Turing Test. Philosophical arguments, computational
concerns, and repercussions in other disciplines are all
discussed. Furthermore, this thesis studies the Turing Test as a
special kind of conversation. In doing so, the relationship between
existing theories of conversation and human-computer communication is
explored. In particular, Grice's cooperative principle and
conversational maxims are concentrated on. Viewing the Turing Test as
conversation and computers as language users have significant effects
on the way we look at Artificial Intelligence, and on communication in
general.
Keywords: Turing Test, Artificial Intelligence,
Conversational maxims, Cooperative Principle, Pragmatics, Natural
Language Conversation Systems, Chatterbots, Conversation Analysis,
Cognitive Science, Philosophy of Language, Computational Linguistics.
BU-CEIS-9912
TITLE: Updating Large Itemsets With Early Pruning
AUTHORS: Necip Fazil Ayan
ABSTRACT: With the computerization of many business and government
transactions, huge amounts of data have been stored in computers. The
existing database systems do not provide the users with the necessary
tools and functionalities to capture all stored information
easily. Therefore, automatic knowledge discovery techniques have been
developed to capture and use the voluminous information hidden in
large databases. Discovery of association rules is an important class
of data mining, which is the process of extracting interesting and
frequent patterns from the data. Association rules aim to capture
the co-occurrences of items, and have wide applicability in many
areas. Discovering association rules is based on the computation of
large itemsets (set of items that occur frequently in the database)
efficiently, and is a computationally expensive operation in large
databases. Thus, maintenance of them in large dynamic databases is an
important issue. In this thesis, we propose an efficient algorithm,
to update large itemsets by considering the set of previously discovered
itemsets. The main idea is to prune an itemset as soon as it is understood
to be small in the updated database, and to keep the set of candidate
large itemsets as small as possible. The proposed algorithm outperforms
the existing update algorithms in terms of the number of scans over the
databases, and the number of candidate large itemsets generated and
counted. Moreover, it can be applied to other data mining tasks that are
based on large itemset framework easily.
Keywords: Data mining, association rules, large itemsets, update of
large itemsets, early pruning.
BU-CEIS-9913
TITLE: A LARGE VOCABULARY SPEECH RECOGNITION SYSTEM FOR TURKISH
AUTHORS: Cemal Yilmaz
ABSTRACT: This thesis presents a large vocabulary isolated word speech recognition
system for Turkish.
The triphones modeled by three-state Hidden Markov
Models (HMM) are used as the smallest unit for the recognition. The
HMM model of a word is
constructed by using the HMM models of the triphones which make up
the word. In the training stage, the word model is trained as a whole
and then each HMM model of the triphones is extracted from the word
model and it is stored individually. In
the recognition stage, HMM models of triphones are used to
construct the HMM models of the words in the dictionary. In this way,
the words that are not trained can be recognized in the recognition stage.
A new dictionary model based on trie structure is introduced for
Turkish with a new search strategy for a given word. This
search strategy performs breadth-first traversal on the trie and
uses the appropriate region of the speech signal at each level of the trie.
Moreover, it is integrated with a pruning strategy to improve both the
system response time and recognition rate.
Keywords: Speech recognition, triphones, Hidden Markov
Model (HMM), trie-based dictionary model, trie-based search strategy.
BU-CEIS-9914
TITLE: EXPLOITING DATA MINING TECHNIQUES FOR BROADCASTING DATA IN
MOBILE COMPUTING ENVIRONMENTS
AUTHORS: Yucel Saygin and Ozgur Ulusoy
ABSTRACT: Mobile computers can be equipped with wireless communication
devices that enable users to access data services from any
location. In wireless communication, the server-to-client (downlink)
communication bandwidth is much higher than the client-to-server
(uplink) communication bandwidth. This asymmetry makes the
dissemination of data to client machines a desirable approach. It is
crucial for mobile clients to use their scarce resources efficiently
while retrieving the required data from the data broadcast by the
server machine. In this paper, we propose some methods for predicting
and pushing the data that might be requested subsequently by mobile
clients. Our methods are based on analyzing the broadcast history
(i.e., the chronological sequence of items that have been requested by
clients) using data mining techniques. We also propose cache
replacement and prefetching techniques for mobile clients based on the
association rules obtained by mining the broadcast history. The
proposed methods are implemented on a web log to estimate their
effectiveness. It is shown through performance experiments that the
proposed rule-based methods are effective in improving the system
performance in terms of the cache hit ratio of mobile clients.
Keywords: Broadcast disks, broadcast histories, mobile databases,
data mining, caching, prefetching.
BU-CEIS-9915
TITLE: HYPERGRAPH MODELS FOR SPARSE MATRIX PARTITIONING AND REORDERING
AUTHOR: Umit V. Catalyurek
ABSTRACT:
Graphs have been widely used to represent sparse matrices for various
scientific applications including one-dimensional (1D) decomposition
of sparse matrices for parallel sparse-matrix vector multiplication
(SpMxV) and sparse matrix reordering for low fill factorization. The
standard graph-partitioning based 1D decomposition of sparse matrices
does not reflect the actual communication volume requirement for
parallel SpMxV. We propose two computational hypergraph models which
avoid this crucial deficiency of the graph model on 1D decomposition.
The proposed models reduce the 1D decomposition problem to the
well-known hypergraph partitioning problem. In the literature, there
is a lack of 2D decomposition heuristic which directly minimizes the
communication requirements for parallel SpMxV computations. Three
novel hypergraph models are introduced for 2D decomposition of sparse
matrices for minimizing the communication volume requirement. The
first hypergraph model is proposed for fine-grain 2D decomposition of
the sparse matrices for parallel SpMxV. The second hypergraph model
for 2D decomposition is proposed to produce jagged-like decomposition
of the sparse matrix. The checkerboard decomposition based parallel
matrix-vector multiplication algorithms are widely encountered in the
literature. However, only the load balancing problem is addressed in
those works. Here, we propose a new hypergraph model which aims the
minimization of communication volume while maintaining the load
balance among the processors for checkerboard decomposition, as the
third model for 2D decomposition. The proposed model reduces the
decomposition problem to the multi-constraint hypergraph partitioning
problem. The notion of multi-constraint partitioning has recently
become popular in graph partitioning. We applied the multi-constraint
partitioning to the hypergraph partitioning problem for solving
checkerboard partitioning. Graph partitioning by vertex separator
(GPVS) is widely used for nested dissection based low fill ordering of
sparse matrices for direct solution of linear systems. In this work,
we also show that the GPVS problem can formulated as hypergraph
partitioning. We exploit this finding to develop a novel hypergraph
partitioning-based nested dissection ordering. The recently proposed
successful multilevel framework is exploited to develop a multilevel
hypergraph partitioning tool PaToH for the experimental verification
of our proposed hypergraph models. Experimental results on a wide
range of realistic sparse test matrices confirm the validity of the
proposed hypergraph models. In terms of communication volume, the
proposed hypergraph models produce 30% and 59% better decompositions
than the graph model in 1D and 2D decompositions of sparse matricies
for parallel SpMxV computations, respectively. The proposed
hypergraph partitioning-based nested dissection produces 25% to 45%
better orderings than the widely used multiple mimimum degree ordering
in the ordering of various test matrices arising from different
applications.
Keywords:
Sparse matrices, parallel matrix-vector multiplication, parallel
processing, matrix decomposition, computational graph model, graph
partitioning, computational hypergraph model, hypergraph partitioning,
ll reducing ordering, nested dissection.