All technical reports files are in compressed (gzip) postscript format.
The file names are of the sort BU-CE-YYXX.ps.gz
BU-CE-0001
TITLE: Statistical Modeling of Turkish for Automatic Topic Segmentation
AUTHORS: Gökhan Tür, Dilek Zeynep Hakkani-Tür, Kemal Oflazer
ABSTRACT: This paper presents the firstever study on statistical segmentation
of Turkish text into topics. Our results indicate that it is
possible to overcome the problems due to the highly agglutinative
nature of Turkish by building a statistical model based on the
morphological analyses of the words instead of the surface forms. We
have achieved 10.90% segmentation error rate on our test set
according to the weighted TDT2 segmentation cost metric. This is
32% better than the word-based baseline model.
BU-CE-0002
TITLE: Statistical Morphological Disambiguation for Agglutinative Languages
AUTHORS: Dilek Zeynep Hakkani-Tür, Kemal Oflazer, Gökhan Tür
ABSTRACT:
In this paper, we present statistical models for
morphological disambiguation in Turkish. Turkish presents an
interesting problem for statistical models since the potential tag
set size, with tags making the relevant morhosyntactic and semantic
distinctions, is very large because of the productive derivational
morphology. We propose to handle this by breaking up the
morhosyntactic tags into inflectional groups, each of which
contains
the inflectional features for each (intermediate) derived form. Our
statistical
models score the probability of each morhosyntactic tag by
considering statistics over the individual inflection groups in a
trigram model. Among the
three models that we have developed and tested, the simplest model
ignoring the local morphotactics within words performs the best.
Our best trigram model performs with 93.95%
accuracy on our test data getting all the morhosyntactic and
semantic features correct. If we are just interested in
syntactically relevant features and ignore a very small set of
semantic features, then the accuracy rises to 95.07%.
BU-CE-0003 (1.5MB)
PDF version(600KB)
TITLE: Bootstrapping Morphological Analyzers by Combining
Human Elicitation and Machine Learning
AUTHORS: Kemal Oflazer (Bilkent University), Sergei Nirenburg (New Mexico State University), Marjorie McShane (New Mexico State University)
ABSTRACT:
This paper presents a semi-automatic technique for developing
broad-coverage
finite-state morphological analyzers for use in natural language
processing applications. It consists of
three components -- elicitation of linguistic information from
humans, a
machine learning bootstrapping scheme and a testing environment.
The three components are applied iteratively until a threshold of
output quality is attained. The initial application of this technique
is for morphology of low-density languages in the context of the
Expedition project at NMSU Computing Research Laboratory. This
elicit-build-test technique
compiles lexical and inflectional information elicited from a human
into a finite state transducer lexicon and combines this with a
sequence of morphographemic rewrite rules that is induced using
transformation-based learning from the elicited examples. The
resulting morphological analyzer is then tested against a test suite,
and any corrections are fed back into the learning procedure that
builds an improved analyzer.
BU-CE-0004
TITLE: Optimization and cost analysis of multi-stage and
multi-channel filtering configurations
AUTHORS: M. Alper Kutay (Drexel University), Hakan Ozaktas
(Eastern Mediterranean University), M. Fatih Erden (Massana Ltd.), and
Haldun M. Ozaktas (Bilkent University).
ABSTRACT:
The fractional Fourier domain multi-channel and multi-stage filtering
configurations that have been recently proposed
enable us to obtain either exact realizations or useful approximations
of linear systems or matrix-vector products in many different applications.
We discuss the solution and cost analysis for these configurations.
It is shown that the problem can be reduced to a least squares problem which
can be solved with fast iterative techniques.
BU-CE-0005
TITLE: Time-Order Signal Representations
AUTHORS: Haldun M. Ozaktas (Bilkent University) and
M. Alper Kutay (Drexel University).
ABSTRACT:
Just like time-frequency and time-scale representations,
time-order signal representations constitute
an alternative way of displaying the content of a signal, with
the potential to reveal features which may not be evident upon
examination of its other representations and to lead to novel
processing techniques.
We derive many of their properties and their relations to other
time-frequency representations.
Their importance stems from the fact that the Radon
transforms (integral projections) and slices of the Wigner
distribution and the ambiguity function can be expressed in terms of
products or convolutions of various scaled forms of the time-order
representation and its two-dimensional Fourier transform.
BU-CE-0006
Wigner distributions, linear canonical transforms, and
phase-space optics
TITLE:
AUTHORS: Haldun M. Ozaktas (Bilkent University) and
M. Alper Kutay (Drexel University).
ABSTRACT:
This report is an extended self-contained tutorial on
Wigner distributions, linear canonical transforms, and phase-space optics,
supplemented with relevant preliminary material.
BU-CE-0007
TITLE: The Discrete Fractional Fourier Transform and Harper's Equation
AUTHORS: Laurence Barker (Bilkent University)
ABSTRACT:
We show that the discrete fractional Fourier transform
recovers the continuum fractional Fourier transform via a limiting process
whereby inner products are preserved.
BU-CE-0009
TITLE: Instance-Based Regression by Partitioning Feature Projections
AUTHOR: Ilhan Uysal
ABSTRACT:
A new instance-based learning method is presented for regression
problems with high-dimensional data. As an instance-based approach,
the conventional K-Nearest Neighbor (KNN) method has been applied to
both classification and regression problems. Although KNN performs
well for classification tasks, it does not perform similarly for
regression problems. We have developed a new instance-based method,
called Regression by Partitioning Feature Projections (RPFP), to fill
the gap in the literature for a lazy method that achieves a higher
accuracy for regression problems. We also present some additional
properties and even better performance when compared to famous eager
approaches of machine learning and statistics literature such as MARS,
rule-based regression, and regression tree induction systems. The most
important property of RPFP is that it performs much better than all
other eager or lazy approaches on many domains that have missing
values. If we consider databases today, where there are generally
large number of attributes, such sparse domains are very
frequent. RPFP handles such missing values in a very natural way,
since it does not require all the attribute values to be present in
the data set.
Keywords: Machine learning, instance-based learning, regression.
BU-CE-0010
TITLE: Regression by Selecting Appropriate Feature(s)
AUTHOR: Tolga Aydin and H. Altay Guvenir
ABSTRACT:
This paper describes two machine learning methods, called
Regression by Selecting Best Feature Projection (RSBFP) and
Regression by Selecting Best Features (RSBF). RSBFP projects
the training data on each feature dimension and produces exactly one
simple linear regression line on each continuous feature. In the case
of categorical features, exactly one simple linear regression line per
each distinct value of each categorical feature is produced. Then
these simple linear regression lines are analyzed to determine the
feature projection that is best among all projections. The best
feature projection and its corresponding simple linear regression line
is then selected to make predictions. RSBF consists of two phases: The
first phase uses RSBFP to sort predictive power of each feature
projection. The second phase calls multiple linear least squares
regression for number of features times, each time excluding the worst
feature among the current set, and produces multiple linear regression
lines. Finally, these regression lines are analyzed to select the best
subset of features. The best subset of features and its corresponding
multiple linear regression line is then selected to make predictions.
Keywords: Prediction, feature projection, regression.
BU-CE-0011
TITLE: Instance-based Regression by Partitioning Feature Projections
AUTHOR: Ilhan Uysal and H. Altay Guvenir
ABSTRACT:
This paper presents a new instance-based method, called
Regression by Partitioning Feature Projections (RPFP) to fill the
gap in the literature for lazy methods that achieves higher
accuracies for regression problems. RPFP also presents some
additional advantages and even better performance when compared to
other lazy approaches such as k-Nearest Neighbor Regression and
Locally Weighted Regression. RPFP makes predictions on each feature
dimension separately, and combines these predictions to find a
prediction for a given query instance. Even with its additive nature,
RPFP can properly handle strong dependencies between features as
shown by its comparison with the additive algorithm, Regression
on Feature Projections (RFP). Besides those benefits, RPFP enjoys
some other properties; e.g., it is robust to missing values,
irrelevant features and the curse of dimensionality.
Keywords: Regression, function approximation, feature projections.
BU-CE-0012
TITLE: An Eager Regression Method Based on Selecting Appropriate Features
AUTHOR: Tolga Aydin and H. Altay Guvenir
ABSTRACT:
This paper describes a machine learning method, called Regression by
Selecting Best Features (RSBF). RSBF consists of two phases: The first
phase aims to find the predictive power of each feature attribute by
constructing simple linear regression lines, one per each continuous
feature and number of categories per each categorical
feature. Although the predictive power of a continuous feature is
constant, it varies for each distinct value of categorical
features. The second phase constructs multiple linear regression lines
among continuous features, each time excluding the worst feature among
the current set, and constructs multiple linear regression
lines. Finally, these multiple linear regression lines and categorical
features' simple linear regression lines are sorted according to their
predictive power. In the querying phase of learning, the best linear
regression line and the features constructing that line are selected
to make predictions.
Keywords: Prediction, Feature Projection, Regression.
BU-CE-0013
TITLE: Stochastic Comparison on Nearly Completely Decomposable Markov Chains
AUTHOR: Denizhan N. Alparslan
ABSTRACT:
This thesis presents an improved version of a componentwise bounding
algorithm for the steady state probability vector of nearly completely
decomposable Markov chains. The given two-level algorithm uses
aggregation and stochastic comparison with the strong stochastic (st)
order. In order to improve accuracy, it employs reordering of states
and a better componentwise probability bounding algorithm given
st upper- and lower-bounding probability vectors. A thorough analysis
of the algorithm from the point of view of irreducibility is provided.
The bounding algorithm is implemented in sparse storage and its
implementation details are given. Numerical results on an application
of wireless Asynchronous Transfer Mode network show that there are
cases in which the given algorithm proves to be useful in computing
bounds on the performance measures of the system. An improvement in
the algorithm that must be considered to obtain better bounds on
performance measures is also presented at the end.
Keywords: Markov chains, near complete decomposability,
stochastic comparison, st-order, reorderings, aggregation
BU-CE-0014
TITLE: Regression by Selecting Best Feature(s)
AUTHOR: Tolga Aydin
ABSTRACT:
Two new machine learning methods, Regression by Selecting Best Feature
Projections (RSBFP) and Regression by Selecting Best Features (RSBF),
are presented for regression problems. These methods heavily make use
of least squares regression to induce eager, parametric and
context-sensitive models. Famous regression approaches of machine
learning and statistics literature such as DART, MARS, RULE and kNN
can not construct models that are both predictive and having
reasonable training and/or querying time durations. We developed RSBFP
and RSBF to fill the gap in the literature for a regression method
having higher predictive accuracy and faster training and querying
time durations. RSBFP constructs a decision list consisting of simple
linear regression lines belonging to linear features and/or
categorical feature segments. RSBF is the extended version of RSBFP
such that the decision list consists of both simple, belonging to
categorical feature segments, and/or multiple, belonging to linear
features, linear regression lines. A relevancy heuristic has been
developed to determine the features involving in the multiple
regression lines. It is shown that the proposed methods are robust to
irrelevant features, missing feature values and target feature noise,
which make them suitable prediction tools for real-world databases. In
terms of robustness, RSBFP and RSBF give better results when compared
to other famous regression methods.
Keywords: Regression, function approximation, feature projections
BU-CE-0015
TITLE: An Efficient Broadcast Scheduling Algorithm for Pull-Based
Mobile Environments
AUTHOR: K. Murat Karakaya
ABSTRACT:
Thanks to the advances in telecommunications and computers, today
mobile computing becomes a significant means in every pace of
life. Many people are now carrying portable devices such as laptop
computers, Personal Digital Assistants (PDAs), and cellular
phones. These mobile computing devices are supported by rapidly
expanding telecommunication technology. Cellular communication,
wireless LAN and WAN, and satellite services are available for daily
life applications, and portable devices make use of these wireless
connections to contact with the information providers. Thus, a user
does not need to maintain a fixed connection in the network and may
enjoy almost unrestricted user mobility.
As the new and various mobile infrastructures emerge, users demand
a new class of applications running in this environment. However, the
narrow bandwidth of the wireless communication channels, the
relatively short active life of the power supplies of mobile units,
and the mobility of clients make the problem of data retrieval more
difficult than that in wired networks. Therefore, mechanisms to
efficiently transmit information to vast numbers of mobile users are
of significant interest. Data broadcasting has been considered one of
the most promising ways of data dissemination in mobile
environments. There are two basic data broadcasting approaches
available: push and pull. In push-based broadcasting approach, data is
broadcast to mobile users according to users' profiles or
subscriptions, whereas in pull-based broadcasting approach,
transmission of data is initiated by the explicit request of users.
In this thesis, we have focused on the problem of scheduling data
items to broadcast in a pull-based environment. We have developed an
efficient broadcast scheduling algorithm, and comparing its
performance against several well-known algorithms, we have observed
that our algorithm appears to be one of the best algorithms proposed
so far.
Keywords: Mobile computing, mobile database, data broadcast,
broadcast delivery, pull-based, scheduling algorithm
BU-CE-0016
TITLE:
Stochastic Automata Networks and Near Complete Decomposability
AUTHORS: Oleg Gusak, Tugrul Dayar, and Jean-Michel Fourneau
ABSTRACT:
Stochastic automata networks (SANs) have been developed and
used in the last fifteen years as a modeling formalism for large systems
that can be decomposed into loosely connected components. In this work,
we extend the near complete decomposability concept of Markov chains
(MCs) to SANs so that the inherent difficulty associated with solving
the underlying MC can be forecasted and solution techniques based on
this concept can be investigated. A straightforward approach to finding
a nearly completely decomposable (NCD) partitioning of the MC underlying
a SAN requires the computation of the nonzero elements of its global
generator. This is not feasible for very large systems even in sparse
matrix representation due to memory and execution time constraints.
We devise an efficient solution algorithm to this problem that is based
on analyzing the NCD structure of each component of a given SAN.
Numerical results show that the given algorithm performs much better
than the straightforward approach.
Keywords:
Markov chains, stochastic automata networks, near complete
decomposability, state classification.
BU-CE-0017
TITLE: Hypergraph Based Declustering for Multi-Disk Databases
AUTHORS: Mehmet Koyuturk
ABSTRACT:
In very large distributed database systems, the data is declustered in
order to exploit parallelism while processing a query. Declustering
refers to allocating the data into multiple disks in such a way that
the tuples belonging to a relation are distributed evenly across
disks. There are many declustering strategies proposed in the
literature, however these strategies are domain specific or have
deficiencies. We propose a model that exactly fits the problem and
show that iterative improvement schemes can capture detailed
per-relation basis declustering objective. We provide a two phase
iterative improvement based algorithm and appropriate gain functions for
these algorithms. The experimental results show that the proposed
algorithm provides a significant performance improvement compared
to the state-of-the-art graph-partitioning based declustering strategy.
Keywords:
Distributed Databases, Declustering, Hypergraph Partitioning,
Max-cut Graph Partitioning