Technical Reports 2000

Technical Reports Published in 2000

BU-CE-0001: PDF

Statistical Modeling of Turkish for Automatic Topic Segmentation
Gökhan Tür, Dilek Zeynep Hakkani-Tür, Kemal Oflazer
This paper presents the firstever study on statistical segmentation of Turkish text into topics. Our results indicate that it is possible to overcome the problems due to the highly agglutinative nature of Turkish by building a statistical model based on the morphological analyses of the words instead of the surface forms. We have achieved 10.90% segmentation error rate on our test set according to the weighted TDT2 segmentation cost metric. This is 32% better than the word-based baseline model.
BU-CE-0002: PDF

Statistical Morphological Disambiguation for Agglutinative Languages
Dilek Zeynep Hakkani-Tür, Kemal Oflazer, Gökhan Tür
In this paper, we present statistical models for morphological disambiguation in Turkish. Turkish presents an interesting problem for statistical models since the potential tag set size, with tags making the relevant morhosyntactic and semantic distinctions, is very large because of the productive derivational morphology. We propose to handle this by breaking up the morhosyntactic tags into inflectional groups, each of which contains the inflectional features for each (intermediate) derived form. Our statistical models score the probability of each morhosyntactic tag by considering statistics over the individual inflection groups in a trigram model. Among the three models that we have developed and tested, the simplest model ignoring the local morphotactics within words performs the best. Our best trigram model performs with 93.95% accuracy on our test data getting all the morhosyntactic and semantic features correct. If we are just interested in syntactically relevant features and ignore a very small set of semantic features, then the accuracy rises to 95.07%.
BU-CE-0003: PDF

Bootstrapping Morphological Analyzers by Combining Human Elicitation and Machine Learning
Kemal Oflazer (Bilkent University), Sergei Nirenburg (New Mexico State University), Marjorie McShane (New Mexico State University)
This paper presents a semi-automatic technique for developing broad-coverage finite-state morphological analyzers for use in natural language processing applications. It consists of three components -- elicitation of linguistic information from humans, a machine learning bootstrapping scheme and a testing environment. The three components are applied iteratively until a threshold of output quality is attained. The initial application of this technique is for morphology of low-density languages in the context of the Expedition project at NMSU Computing Research Laboratory. This elicit-build-test technique compiles lexical and inflectional information elicited from a human into a finite state transducer lexicon and combines this with a sequence of morphographemic rewrite rules that is induced using transformation-based learning from the elicited examples. The resulting morphological analyzer is then tested against a test suite, and any corrections are fed back into the learning procedure that builds an improved analyzer.
BU-CE-0004: PDF

Optimization and Cost Analysis of Multi-stage and Multi-channel Filtering Configurations
M. Alper Kutay (Drexel University), Hakan Özaktaş (Eastern Mediterranean University), M. Fatih Erden (Massana Ltd.), and Haldun M. Özaktaş (Bilkent University)
The fractional Fourier domain multi-channel and multi-stage filtering configurations that have been recently proposed enable us to obtain either exact realizations or useful approximations of linear systems or matrix-vector products in many different applications. We discuss the solution and cost analysis for these configurations. It is shown that the problem can be reduced to a least squares problem which can be solved with fast iterative techniques.
BU-CE-0005: PDF

Time-Order Signal Representations
Haldun M. Özaktaş (Bilkent University) and M. Alper Kutay (Drexel University)
Just like time-frequency and time-scale representations, time-order signal representations constitute an alternative way of displaying the content of a signal, with the potential to reveal features which may not be evident upon examination of its other representations and to lead to novel processing techniques. We derive many of their properties and their relations to other time-frequency representations. Their importance stems from the fact that the Radon transforms (integral projections) and slices of the Wigner distribution and the ambiguity function can be expressed in terms of products or convolutions of various scaled forms of the time-order representation and its two-dimensional Fourier transform.
BU-CE-0006: PDF

Wigner Distributions, Linear Canonical Transforms, and Phase-space Optics
Haldun M. Özaktaş (Bilkent University) and M. Alper Kutay (Drexel University)
This report is an extended self-contained tutorial on Wigner distributions, linear canonical transforms, and phase-space optics, supplemented with relevant preliminary material.
BU-CE-0007: PDF

The Discrete Fractional Fourier Transform and Harper's Equation
Laurence Barker (Bilkent University)
We show that the discrete fractional Fourier transform recovers the continuum fractional Fourier transform via a limiting process whereby inner products are preserved.
BU-CE-0009: PDF

Instance-Based Regression by Partitioning Feature Projections (M. Sc. Thesis)
İlhan Uysal
A new instance-based learning method is presented for regression problems with high-dimensional data. As an instance-based approach, the conventional K-Nearest Neighbor (KNN) method has been applied to both classification and regression problems. Although KNN performs well for classification tasks, it does not perform similarly for regression problems. We have developed a new instance-based method, called Regression by Partitioning Feature Projections (RPFP), to fill the gap in the literature for a lazy method that achieves a higher accuracy for regression problems. We also present some additional properties and even better performance when compared to famous eager approaches of machine learning and statistics literature such as MARS, rule-based regression, and regression tree induction systems. The most important property of RPFP is that it performs much better than all other eager or lazy approaches on many domains that have missing values. If we consider databases today, where there are generally large number of attributes, such sparse domains are very frequent. RPFP handles such missing values in a very natural way, since it does not require all the attribute values to be present in the data set.
Keywords: Machine learning, instance-based learning, regression.
BU-CE-0010: PDF

Regression by Selecting Appropriate Feature(s)
Tolga Aydın and H. Altay Güvenir
This paper describes two machine learning methods, called Regression by Selecting Best Feature Projection (RSBFP) and Regression by Selecting Best Features (RSBF). RSBFP projects the training data on each feature dimension and produces exactly one simple linear regression line on each continuous feature. In the case of categorical features, exactly one simple linear regression line per each distinct value of each categorical feature is produced. Then these simple linear regression lines are analyzed to determine the feature projection that is best among all projections. The best feature projection and its corresponding simple linear regression line is then selected to make predictions. RSBF consists of two phases: The first phase uses RSBFP to sort predictive power of each feature projection. The second phase calls multiple linear least squares regression for number of features times, each time excluding the worst feature among the current set, and produces multiple linear regression lines. Finally, these regression lines are analyzed to select the best subset of features. The best subset of features and its corresponding multiple linear regression line is then selected to make predictions.
Keywords: Prediction, feature projection, regression.
BU-CE-0011: PDF

Instance-based Regression by Partitioning Feature Projections
İlhan Uysal and H. Altay Güvenir
This paper presents a new instance-based method, called Regression by Partitioning Feature Projections (RPFP) to fill the gap in the literature for lazy methods that achieves higher accuracies for regression problems. RPFP also presents some additional advantages and even better performance when compared to other lazy approaches such as k-Nearest Neighbor Regression and Locally Weighted Regression. RPFP makes predictions on each feature dimension separately, and combines these predictions to find a prediction for a given query instance. Even with its additive nature, RPFP can properly handle strong dependencies between features as shown by its comparison with the additive algorithm, Regression on Feature Projections (RFP). Besides those benefits, RPFP enjoys some other properties; e.g., it is robust to missing values, irrelevant features and the curse of dimensionality.
Keywords: Regression, function approximation, feature projections.
BU-CE-0012: PDF

An Eager Regression Method Based on Selecting Appropriate Features
Tolga Aydın and H. Altay Güvenir
This paper describes a machine learning method, called Regression by Selecting Best Features (RSBF). RSBF consists of two phases: The first phase aims to find the predictive power of each feature attribute by constructing simple linear regression lines, one per each continuous feature and number of categories per each categorical feature. Although the predictive power of a continuous feature is constant, it varies for each distinct value of categorical features. The second phase constructs multiple linear regression lines among continuous features, each time excluding the worst feature among the current set, and constructs multiple linear regression lines. Finally, these multiple linear regression lines and categorical features' simple linear regression lines are sorted according to their predictive power. In the querying phase of learning, the best linear regression line and the features constructing that line are selected to make predictions.
Keywords: Prediction, Feature Projection, Regression.
BU-CE-0013: PDF

Stochastic Comparison on Nearly Completely Decomposable Markov Chains (M. Sc. Thesis)
Denizhan N. Alparslan
This thesis presents an improved version of a componentwise bounding algorithm for the steady state probability vector of nearly completely decomposable Markov chains. The given two-level algorithm uses aggregation and stochastic comparison with the strong stochastic (st) order. In order to improve accuracy, it employs reordering of states and a better componentwise probability bounding algorithm given st upper- and lower-bounding probability vectors. A thorough analysis of the algorithm from the point of view of irreducibility is provided. The bounding algorithm is implemented in sparse storage and its implementation details are given. Numerical results on an application of wireless Asynchronous Transfer Mode network show that there are cases in which the given algorithm proves to be useful in computing bounds on the performance measures of the system. An improvement in the algorithm that must be considered to obtain better bounds on performance measures is also presented at the end.
Keywords: Markov chains, near complete decomposability, stochastic comparison, st-order, reorderings, aggregation
BU-CE-0014: PDF

Regression by Selecting Best Feature(s) (M. Sc. Thesis)
Tolga Aydın
Two new machine learning methods, Regression by Selecting Best Feature Projections (RSBFP) and Regression by Selecting Best Features (RSBF), are presented for regression problems. These methods heavily make use of least squares regression to induce eager, parametric and context-sensitive models. Famous regression approaches of machine learning and statistics literature such as DART, MARS, RULE and kNN can not construct models that are both predictive and having reasonable training and/or querying time durations. We developed RSBFP and RSBF to fill the gap in the literature for a regression method having higher predictive accuracy and faster training and querying time durations. RSBFP constructs a decision list consisting of simple linear regression lines belonging to linear features and/or categorical feature segments. RSBF is the extended version of RSBFP such that the decision list consists of both simple, belonging to categorical feature segments, and/or multiple, belonging to linear features, linear regression lines. A relevancy heuristic has been developed to determine the features involving in the multiple regression lines. It is shown that the proposed methods are robust to irrelevant features, missing feature values and target feature noise, which make them suitable prediction tools for real-world databases. In terms of robustness, RSBFP and RSBF give better results when compared to other famous regression methods.
Keywords: Regression, function approximation, feature projections
BU-CE-0015: PDF

An Efficient Broadcast Scheduling Algorithm for Pull-Based Mobile Environments (M. Sc. Thesis)
K. Murat Karakaya
Thanks to the advances in telecommunications and computers, today mobile computing becomes a significant means in every pace of life. Many people are now carrying portable devices such as laptop computers, Personal Digital Assistants (PDAs), and cellular phones. These mobile computing devices are supported by rapidly expanding telecommunication technology. Cellular communication, wireless LAN and WAN, and satellite services are available for daily life applications, and portable devices make use of these wireless connections to contact with the information providers. Thus, a user does not need to maintain a fixed connection in the network and may enjoy almost unrestricted user mobility.
As the new and various mobile infrastructures emerge, users demand a new class of applications running in this environment. However, the narrow bandwidth of the wireless communication channels, the relatively short active life of the power supplies of mobile units, and the mobility of clients make the problem of data retrieval more difficult than that in wired networks. Therefore, mechanisms to efficiently transmit information to vast numbers of mobile users are of significant interest. Data broadcasting has been considered one of the most promising ways of data dissemination in mobile environments. There are two basic data broadcasting approaches available: push and pull. In push-based broadcasting approach, data is broadcast to mobile users according to users' profiles or subscriptions, whereas in pull-based broadcasting approach, transmission of data is initiated by the explicit request of users.
In this thesis, we have focused on the problem of scheduling data items to broadcast in a pull-based environment. We have developed an efficient broadcast scheduling algorithm, and comparing its performance against several well-known algorithms, we have observed that our algorithm appears to be one of the best algorithms proposed so far.
Keywords: Mobile computing, mobile database, data broadcast, broadcast delivery, pull-based, scheduling algorithm
BU-CE-0016: PDF

Stochastic Automata Networks and Near Complete Decomposability
Oleg Gusak, Tuğrul Dayar, and Jean-Michel Fourneau
Stochastic automata networks (SANs) have been developed and used in the last fifteen years as a modeling formalism for large systems that can be decomposed into loosely connected components. In this work, we extend the near complete decomposability concept of Markov chains (MCs) to SANs so that the inherent difficulty associated with solving the underlying MC can be forecasted and solution techniques based on this concept can be investigated. A straightforward approach to finding a nearly completely decomposable (NCD) partitioning of the MC underlying a SAN requires the computation of the nonzero elements of its global generator. This is not feasible for very large systems even in sparse matrix representation due to memory and execution time constraints. We devise an efficient solution algorithm to this problem that is based on analyzing the NCD structure of each component of a given SAN. Numerical results show that the given algorithm performs much better than the straightforward approach.
Keywords: Markov chains, stochastic automata networks, near complete decomposability, state classification.
BU-CE-0017: PDF

Hypergraph Based Declustering for Multi-Disk Databases
Mehmet Koyutürk
In very large distributed database systems, the data is declustered in order to exploit parallelism while processing a query. Declustering refers to allocating the data into multiple disks in such a way that the tuples belonging to a relation are distributed evenly across disks. There are many declustering strategies proposed in the literature, however these strategies are domain specific or have deficiencies. We propose a model that exactly fits the problem and show that iterative improvement schemes can capture detailed per-relation basis declustering objective. We provide a two phase iterative improvement based algorithm and appropriate gain functions for these algorithms. The experimental results show that the proposed algorithm provides a significant performance improvement compared to the state-of-the-art graph-partitioning based declustering strategy.
Keywords: Distributed Databases, Declustering, Hypergraph Partitioning, Max-cut Graph Partitioning