Technical Reports Published in 1999


Dependency Parsing with an Extended Finite State Approach

Kemal Oflazer

This paper presents a dependency parsing scheme using an extended finite state approach. The parser augments input representation with "channels" so that links representing syntactic dependency relations among words can be accommodated, and iterates on the input a number of times to arrive at a fixed point. Intermediate configurations violating various constraints of projective dependency representations such as no crossing links, no independent items except sentential head, etc, are filtered via finite state filters. We have applied the parser to dependency parsing of Turkish.


Parallelization of Hierarchical Radiosity Algorithms on Distributed Memory Computers (M. Sc. Thesis)

Ahmet Reşat Sireli

Computing distribution of light in a given environment is an important problem in computer-aided photo-realistic image generation. Radiosity method has been proposed to address this problem which requires an enormous amount of calculation and memory. Hierarchical radiosity method is a recent approach that reduces these computational requirements by careful error analysis. It has its idea from the solution methods of N-body problems. Although hierarchical approach has greatly reduced the amount of calculations, satisfactory results still cannot be obtained in terms of processing time. Exploiting parallelism is a practical way to reduce the computation time further. In this thesis, we have designed and implemented a parallel hierarchical radiosity algorithm for distributed memory computers. Due to its highly irregular computational structure, hierarchical radiosity algorithms do not yield easily to parallelization on distributed memory machines. Dynamically changing computational patterns of the algorithm cause severe load imbalances. Therefore, we have developed a dynamic load balancing technique for the parallel hierarchical radiosity calculation.


Design for a Turkish Treebank

Kemal Oflazer, Dilek Zeynep Hakkani-Tür, Gökhan Tür

We present the issues that we have encountered in designing a treebank for Turkish along with rationale for the choices we have made for various representation schemes. In the resulting representation, the information encoded in the complex agglutinative word structures are represented as a sequence of inflectional groups separated by derivational boundaries. The syntactic relations are encoded as labeled dependency relations among segments of lexical items marked by derivation boundaries.


Processing of Continuous Queries from Moving Objects in Mobile Computing Systems (M. Sc. Thesis)

Hüseyin Gökmen Gök

Recent advances in computer hardware technology and wireless communication networks have led to the emergence of mobile computing systems. In a mobile computing environment, a user with a wireless connection to the information network can access data via submitting queries to the data server. Since the mobility is the most distinguishing feature of the mobile computing paradigm, location becomes an important piece of information for the so called "location-dependent queries" where the answer to a query depends on the current location of the user who issued the query. A location-dependent query submitted by a mobile user can become more difficult to process when it is submitted as a "continuous query" for which the answer changes as the user moves. The answer to a location-dependent continuous query is a set that consists of tuples indicating that object S is the answer of the query from time begin to time end. Once the tuples in the answer set are determined, the next step is to determine when to send these tuples to the user. The transmission time of the tuples is critical in the sense that it can affect the communication overhead imposed on the wireless network and the availability of tuples in case of disconnections. In this thesis, we propose three tuple transmission approaches that determine the transmission time of a tuple in the answer set of a location-dependent continuous query. We also design and implement a simulation model to compare the performance of the proposed tuple transmission approaches under different settings of environmental parameters.


Turing Test: 50 Years Later

Ayşe Pınar Saygın, İlyas Çiçekli, Varol Akman

The Turing Test is one of the most disputed topics in Artificial Intelligence, Philosophy of Mind and Cognitive Science. This paper is a review of the past 50 years of the Turing Test. Philosophical debates, practical developments and repercussions in related disciplines are all covered. We discuss Turing's ideas in detail and present the important comments that have been made on them. Within this context, behaviorism, consciousness, the 'other minds' problem and similar topics in the philosophy of mind are discussed. We also cover the sociological and psychological aspects of the Turing Test. Finally, we take a look at the current situation and analyze the programs that have been developed with the aim of passing the Turing Test. We conclude that the Turing Test has been, and will continue to be, a very influential and controversial topic.


Bankruptcy Prediction Using Feature Projection Based Classification

H. Altay Güvenir, Sengör Altıngövde, İlhan Uysal and Erdal Erel

Bankruptcy prediction has been an important decision-making process for financial analysts. One of the most common approaches for the bankruptcy prediction problem is the Discriminant Analysis. Also, the k-Nearest Neighbor classifier is very successful in such domains. This paper proposes a Feature Projection based classification algorithm, and explores its applicability to the problem of predicting bankruptcy of large firms. The algorithm is evaluated on a particular data set, and its performance is compared with the techniques mentioned above. The experiments indicate that the feature projection based classification algorithm introduced here performs better than these techniques.


An Application of Inductive Learning for Mining Basket Data

İlhan Uysal and H. Altay Güvenir

The development of bar-code technology provided accurate and large market databases for researchers who deal with datasets. Since the data is large both in dimension and size* most exploratory analysis techniques of statistics are not appropriate for such tasks. In this paper, we describe a high-level algorithm, and the application of it on a large basket data, extracted from the database of a big supermarket company. The algorithm have two consecutive steps. Each step is a different popular machine learning method: clustering and classification. In this application, we used KMEANS clustering algorithm and C4.5 classification program respectively. At the end of the application we come up with a set of items that can be employed for promotion. By promotion we aim to increase number of costumers that make their weekly or monthly shopping* which refer to full baskets among transactions.


An Efficient Algorithm to Update Large Itemsets with Early Pruning

Necip Fazıl Ayan, Abdullah Uz Tansel and Erol Arkun

Although many efficient algorithms have been proposed for the discovery of association rules, the process of updating large itemsets is still a complicated issue for dynamic databases that involve frequent additions. We present an efficient algorithm for updating large itemsets (UWEP) when new transactions are added to the set of old transactions in a transaction database. UWEP employs a dynamic look-ahead strategy in updating the existing large itemsets by detecting and removing those that will no longer remain large after the contribution of the new set of transactions. UWEP executes iteratively, but it differs from the other proposed algorithms by scanning the existing database at most once and the new database exactly once. Moreover, it generates and counts the minimum number of candidates in the new database. The experiments on synthetic data show that UWEP outperforms the existing algorithms in terms of the candidates generated and counted.

Keywords: Maintenance of association rules, dynamic pruning, large itemsets.


Regression by Feature Projections

İlhan Uysal and H. Altay Güvenir

This paper describes a machine learning method called, Regression by Feature Projections (RFP) for predicting a real-valued target feature. In RFP training is based on simply storing the projections of the training instances on each feature separately. Prediction of the target value for a query point is obtained through two approximation procedures executed sequentially. The first approximation process is to find the individual predictions of features by using the K-nearest neighbor algorithm (KNN). The second approximation process combines the predictions of all features. During the first approximation step, each feature is associated with a weight in order to determine the prediction ability of the feature at the local query point. The weights, found for each local query point, are used in the second step and enforce the method to have an adaptive or context-sensitive nature. We have compared RFP with the KNN algorithm. Results on real data sets show that RFP is much faster than KNN, yet its prediction accuracy is comparable with the KNN algorithm


Application of Machine Learning Techniques to Differential Diagnosis of Erythemato-Squamous Diseases

Narin Emeksiz and H. Altay Güvenir

This paper is about the implementation of a visual tool for Differential Diagnosis of Erythemato-Squamous Diseases based on the classification algorithms; Nearest Neighbor Classifier (NN), Naive Bayesian Classifier using Normal Distribution (NBC) and Voting Feature Intervals-5 (VFI5). This tool enables the doctors to differentiate six types of Erythemato-Squamous Diseases using clinical and histopathological parameters obtained from a patient. The program also gives explanations for the classifications of each classifier.

Keywords: Machine Learning, Classification, Dermatology, Feature Projections.


Turing Test and Conversation

Ayşe Pınar Saygın

The Turing Test is one of the most disputed topics in Artificial Intelligence, Philosophy of Mind and Cognitive Science. It has been proposed 50 years ago, as a method to determine whether machines can think or not. It embodies important philosophical issues, as well as computational ones. Moreover, because of its characteristics, it requires interdisciplinary attention. The Turing Test posits that, to be granted intelligence, a computer should imitate human conversational behavior so well that it should be indistinguishable from a real human being. From this, it follows that conversation is a crucial concept in its study. Surprisingly, focusing on conversation in relation to the Turing Test has not been a prevailing approach in previous research. This thesis first provides a thorough and deep review of the 50 years of the Turing Test. Philosophical arguments, computational concerns, and repercussions in other disciplines are all discussed. Furthermore, this thesis studies the Turing Test as a special kind of conversation. In doing so, the relationship between existing theories of conversation and human-computer communication is explored. In particular, Grice's cooperative principle and conversational maxims are concentrated on. Viewing the Turing Test as conversation and computers as language users have significant effects on the way we look at Artificial Intelligence, and on communication in general.

Keywords: Turing Test, Artificial Intelligence, Conversational maxims, Cooperative Principle, Pragmatics, Natural Language Conversation Systems, Chatterbots, Conversation Analysis, Cognitive Science, Philosophy of Language, Computational Linguistics.


Updating Large Itemsets With Early Pruning (M. Sc. Thesis)

Necip Fazıl Ayan

With the computerization of many business and government transactions, huge amounts of data have been stored in computers. The existing database systems do not provide the users with the necessary tools and functionalities to capture all stored information easily. Therefore, automatic knowledge discovery techniques have been developed to capture and use the voluminous information hidden in large databases. Discovery of association rules is an important class of data mining, which is the process of extracting interesting and frequent patterns from the data. Association rules aim to capture the co-occurrences of items, and have wide applicability in many areas. Discovering association rules is based on the computation of large itemsets (set of items that occur frequently in the database) efficiently, and is a computationally expensive operation in large databases. Thus, maintenance of them in large dynamic databases is an important issue. In this thesis, we propose an efficient algorithm, to update large itemsets by considering the set of previously discovered itemsets. The main idea is to prune an itemset as soon as it is understood to be small in the updated database, and to keep the set of candidate large itemsets as small as possible. The proposed algorithm outperforms the existing update algorithms in terms of the number of scans over the databases, and the number of candidate large itemsets generated and counted. Moreover, it can be applied to other data mining tasks that are based on large itemset framework easily.

Keywords: Data mining, association rules, large itemsets, update of large itemsets, early pruning.


A Large Vocabulary Speech Recognition System for Turkish (M. Sc. Thesis)

Cemal Yılmaz

This thesis presents a large vocabulary isolated word speech recognition system for Turkish. The triphones modeled by three-state Hidden Markov Models (HMM) are used as the smallest unit for the recognition. The HMM model of a word is constructed by using the HMM models of the triphones which make up the word. In the training stage, the word model is trained as a whole and then each HMM model of the triphones is extracted from the word model and it is stored individually. In the recognition stage, HMM models of triphones are used to construct the HMM models of the words in the dictionary. In this way, the words that are not trained can be recognized in the recognition stage. A new dictionary model based on trie structure is introduced for Turkish with a new search strategy for a given word. This search strategy performs breadth-first traversal on the trie and uses the appropriate region of the speech signal at each level of the trie. Moreover, it is integrated with a pruning strategy to improve both the system response time and recognition rate.

Keywords: Speech recognition, triphones, Hidden Markov Model (HMM), trie-based dictionary model, trie-based search strategy.


Exploiting Data Mining Techniques for Broadcasting Data in Mobile Computing Environments

Yücel Saygın and Özgür Ulusoy

Mobile computers can be equipped with wireless communication devices that enable users to access data services from any location. In wireless communication, the server-to-client (downlink) communication bandwidth is much higher than the client-to-server (uplink) communication bandwidth. This asymmetry makes the dissemination of data to client machines a desirable approach. It is crucial for mobile clients to use their scarce resources efficiently while retrieving the required data from the data broadcast by the server machine. In this paper, we propose some methods for predicting and pushing the data that might be requested subsequently by mobile clients. Our methods are based on analyzing the broadcast history (i.e., the chronological sequence of items that have been requested by clients) using data mining techniques. We also propose cache replacement and prefetching techniques for mobile clients based on the association rules obtained by mining the broadcast history. The proposed methods are implemented on a web log to estimate their effectiveness. It is shown through performance experiments that the proposed rule-based methods are effective in improving the system performance in terms of the cache hit ratio of mobile clients.

Keywords: Broadcast disks, broadcast histories, mobile databases, data mining, caching, prefetching.


Hypergraph Models for Sparse Matrix Partitioning and Reordering (Ph. D. Thesis)

Ümit V. Çatalyürek

Graphs have been widely used to represent sparse matrices for various scientific applications including one-dimensional (1D) decomposition of sparse matrices for parallel sparse-matrix vector multiplication (SpMxV) and sparse matrix reordering for low fill factorization. The standard graph-partitioning based 1D decomposition of sparse matrices does not reflect the actual communication volume requirement for parallel SpMxV. We propose two computational hypergraph models which avoid this crucial deficiency of the graph model on 1D decomposition. The proposed models reduce the 1D decomposition problem to the well-known hypergraph partitioning problem. In the literature, there is a lack of 2D decomposition heuristic which directly minimizes the communication requirements for parallel SpMxV computations. Three novel hypergraph models are introduced for 2D decomposition of sparse matrices for minimizing the communication volume requirement. The first hypergraph model is proposed for fine-grain 2D decomposition of the sparse matrices for parallel SpMxV. The second hypergraph model for 2D decomposition is proposed to produce jagged-like decomposition of the sparse matrix. The checkerboard decomposition based parallel matrix-vector multiplication algorithms are widely encountered in the literature. However, only the load balancing problem is addressed in those works. Here, we propose a new hypergraph model which aims the minimization of communication volume while maintaining the load balance among the processors for checkerboard decomposition, as the third model for 2D decomposition. The proposed model reduces the decomposition problem to the multi-constraint hypergraph partitioning problem. The notion of multi-constraint partitioning has recently become popular in graph partitioning. We applied the multi-constraint partitioning to the hypergraph partitioning problem for solving checkerboard partitioning. Graph partitioning by vertex separator (GPVS) is widely used for nested dissection based low fill ordering of sparse matrices for direct solution of linear systems. In this work, we also show that the GPVS problem can formulated as hypergraph partitioning. We exploit this finding to develop a novel hypergraph partitioning-based nested dissection ordering. The recently proposed successful multilevel framework is exploited to develop a multilevel hypergraph partitioning tool PaToH for the experimental verification of our proposed hypergraph models. Experimental results on a wide range of realistic sparse test matrices confirm the validity of the proposed hypergraph models. In terms of communication volume, the proposed hypergraph models produce 30% and 59% better decompositions than the graph model in 1D and 2D decompositions of sparse matricies for parallel SpMxV computations, respectively. The proposed hypergraph partitioning-based nested dissection produces 25% to 45% better orderings than the widely used multiple mimimum degree ordering in the ordering of various test matrices arising from different applications.

Keywords: Sparse matrices, parallel matrix-vector multiplication, parallel processing, matrix decomposition, computational graph model, graph partitioning, computational hypergraph model, hypergraph partitioning, ll reducing ordering, nested dissection.