All technical reports files are in compressed (gzip) postscript format.
The file names are of the sort BU-CEIS-YYXX.ps.z
BU-CEIS-9601
TITLE: k Nearest Neighbor Classification on Feature Projections
AUTHORS: Aynur Akkus and H. Altay Güvenir
ABSTRACT: This paper proposes a new approach to classification based
on a majority voting on individual classifications made by the
projections of the training set on each feature. We have applied the
k-nearest neighbor algorithm to determine the classifications made
on individual feature projections. We called the resulting algorithm
k-NNFP, for k-Nearest Neighbor on Feature Projections. The most
important characteristic of this technique is that the training
instances are stored as their projections on each feature
dimension. This allows the classification of a new instance to be
made much faster than k-NN algorithm. The voting mechanism reduces
the negative effect of possible irrelevant features in
classification. Also the classification accuracy of k-NNFP
increases when the k value is increased, which indicates that the
process of classification can incorporate the learned classification
knowledge better when k increases. The k-NNFP algorithm is compared
with the k-NN algorithm, in terms of classification accuracy and
running time on some real-world and artificial datasets.
BU-CEIS-9602
TITLE: Tactical Generation in a Free Constituent Order Language
AUTHORS: Dilek Zeynep Hakkani, Kemal Oflazer, and Ilyas Cicekli
ABSTRACT: This paper describes tactical generation in Turkish, a free
constituent order language, in which the order of the constituents may
change according to the information structure of the sentences to be
generated. In the absence of any information regarding the information
structure of a sentence (i.e., topic, focus, background, etc.), the
constituents of the sentence obey a default order, but the order is
almost freely changeable, otherwise. We have used a recursively
structured finite state machine (essentially an ATN minus the
registers) for handling the changes in constituent order, implemented
as a right-linear grammar in a unification-based generation grammar
environment, GenKit, developed at Carnegie Mellon University--Center
for Machine Translation. Morphological realization has been
implemented using an external morphological analysis/generation
component which performs morpheme selection and handles
morphographemic processes.
BU-CEIS-9603
TITLE: Unsupervised Learning in Constraint-based Morphological
Disambiguation
AUTHORS: Kemal Oflazer and Gokhan Tur
ABSTRACT: This paper presents a constraint-based morphological disambiguation
approach that uses unsupervised learning component to discover some
of the constraints it uses. It is specifically applicable to
languages with productive inflectional and derivational
morphological processes, such as Turkish, where morphological
ambiguity has a rather different nature than that found in languages
like English. Our approach starts with a set of corpus-independent
hand-crafted rules that reduce morphological ambiguity (hence
improve precision) without sacrificing recall. It then uses an
untagged training corpus in which all lexical items have been
annotated with all possible morphological analyses, incrementally
proposing and evaluating additional (possibly corpus dependent)
constraints for disambiguation of morphological parses using the
constraints imposed by unambiguous contexts. These rules choose
parses with specified features. It then learns in an unsupervised
manner, additional rules for removing parses with certain
features. In certain respects, our approach has been motivated by
Brill's recent work, but with the
observation that his transformational approach is not directly
applicable to languages like Turkish. Our results indicate that
using hand-crafted rules and rules learned to choose, we can attain
a recall of 99.08% and a precision of 88.08% with 1.119 parses per
token, on the training text. When rules learned to delete are used
in addition to these, we can attain a recall of 96.76% and a
precision of 92.05% and 1.051 parses per token on the training
text. On previously unseen text, we can attain a recall of 98.04%
and a precision of 86.23% with 1.137 parses per token using just
the hand-crafted rules and rules learned to choose. When rules
learned to delete are used we can attain a recall of 96.99% and a
precision of 88.13% and 1.100 parses per token.
BU-CEIS-9604
TITLE: Partial Query Evaluation for Vertically Partitioned Signature
Files in very Large Unformatted Databases (PhD Thesis -- 131 Pages)
AUTHOR: Seyit Kocberber
ABSTRACT:Signature file approach is a well-known indexing technique. The
Bit Sliced Signature File (BSSF) method provides an efficient retrieval
environment by only accessing on-bits of query signatures. However,
its response time increases for increasing number of query terms. For
BSSF we define a stopping condition that tries to avoid the processing
of all on-bits of query signatures through partial evaluation. The aim
of the stopping condition is to reduce the expected number of false
drops to the level that also provides the lowest response time within
the framework of BSSF. We propose the Partially evaluated Bit- Sliced
Signature File (P-BSSF) method that employs the partial evaluation
strategy and minimizes the response time in a multi-term query
environment by considering the submission probabilities of the queries
with different number of terms. Experiments show that P-BSSF provides
85% improvement in response time over BSSF depending on space overhead
and the number of query terms. To provide better optimization of the
signature file parameters in multi-term query environments, we propose
Multi-Fragmented Signature File (MFSF) method as an extension of
P-BSSF. In MFSF, a signature file is divided into variable sized
vertical fragments with different on-bit densities to optimize the
response time using a similar query evaluation methodology. In query
evaluation the query signature on-bits of the lower on-bit density
fragments are used first. As the number of query terms increases, the
number of query signature on-bits in the lower on-bit density
fragments increases and the query stopping condition is reached in
fewer bit slice evaluation steps. Therefore, in MFSF, the response
time decreases for an increasing number of query terms. The analysis
shows that, with no space overhead, MFSF outperforms the P-BSSF and
generalized frame-sliced signature file organizations. Due to hashing
and superimposition operations used in obtaining signatures, the
signature of an irrelevant record may match the query signature, i.e.,
it is possible to have false drops. In signature file processing the
accurate estimation of the number of false drops is essential to
obtain system parameters that will yield a desirable response time.
We propose a more accurate false drop estimation method for databases
with varying record lengths instead of using an average number of
distinct terms per record. In this method, the records of a database
are conceptually partitioned according to the number of distinct terms
they contain and the number of false drops of each group is estimated
separately. Experiments with real data show that depending on the
space overhead, the proposed method obtains up to 33%, 25%, and 20%
response time improvements for the sequential, generalized
frame-sliced, and MFSF methods, respectively. For very large
databases even one bit slice of MFSF may occupy several disk blocks.
We propose the Compressed Multi-Fragmented Signature File (C-MFSF)
method that stores the bit slices of MFSF in a compact form which
provides a better response time. To minimize the number of disk
accesses, the signature size and the disk block size can be adjusted
such that most bit slices fit into a single disk block after
compression. In such environments, C-MFSF evaluates the queries with
more than two terms with only one disk access per query term rather
than two disk accesses of the inverted file method which are
respectively for the pointer of the query term posting list and the
list itself. Our projection based on real data shows that for a
database of one million records C-MFSF provides a response time of
0.85 seconds.
BU-CEIS-9605
TITLE: Inductive Logic Program Synthesis with DIALOGS
AUTHOR: Pierre Flener
ABSTRACT: DIALOGS (Dialog-based Inductive and Abductive LOGic program
Synthesizer) is a schema-guided synthesizer of recursive logic
programs; it takes the initiative and minimally queries a (possibly
computationally naive) specifier for evidence in her/his conceptual
language. The specifier must know the answers to such simple queries,
because otherwise s/he wouldn't even feel the need for the synthesized
program. DIALOGS can be used by any learner (including itself) that
detects, or merely conjectures, the necessity of invention of a new
predicate. Due to its foundation on a powerful codification of a
"recursion-theory" (by means of the template and constraints of a
divide-and-conquer schema), DIALOGS needs very little evidence and is
very fast.
BU-CEIS-9606
TITLE: Learning Translation Templates from Bilingual Texts
AUTHORS: H. Altay Güvenir and Ilyas Cicekli
ABSTRACT: This paper proposes a mechanism for learning
structural correspondences between two languages from a corpus of
translated sentence pairs. The proposed mechanism uses analogical
reasoning between two translations. Given a pair of translations, the
similar parts of the sentences in the source language must correspond
the similar parts of the sentences in the target language. Similarly,
the different parts should correspond to the respective parts in the
translated sentences. The correspondences between the similarities,
and also differences are learned in the form of translation templates.
The system is tested on a small training dataset and produced
promising results for further investigation.
BU-CEIS-9606
TITLE: Using planes of optoelectronic devices interconnected by
free-space optics: systems issues and application opportunities
AUTHORS: Haldun M. Ozaktas
Please note duble numbering with the report above.
BU-CEIS-9607
TITLE: Corpus-Based Learning of Generalized Parse Tree Rules for
Translation
AUTHORS: H. Altay Güvenir and Aysegul Tunc
ABSTRACT: This paper proposes a learning mechanism to acquire structural
correspondences between two languages from a corpus of translated
sentence pairs. The proposed mechanism uses analogical reasoning
between two translations. Given a pair of translations, the similar
parts of the sentences in the source language must correspond the
similar parts of the sentences in the target language. Similarly, the
different parts should correspond to the respective parts in the
translated sentences. The correspondences between the similarities,
and also differences are learned in the form of rewrite rules.
The system is tested on a small training dataset and produced
promising results for further investigation.
BU-CEIS-9608
TITLE: Decomposing Irregularly Sparse Matrices for Parallel
Matrix-Vector Multiplication
AUTHORS: Umit V. Catalyurek and Cevdet Aykanat
ABSTRACT: In this work, we show the deficiencies of the graph model
for decomposing sparse matrices for parallel matrix-vector
multiplication. Then, we propose two hypergraph models which avoid
all deficiencies of the graph model. The proposed models reduce the
decomposition problem to the well-known hypergraph partitioning
problem widely encountered in circuit partitioning in VLSI. We have
implemented fast Kernighan-Lin based graph and hypergraph partitioning
heuristics and used the successful multilevel graph partitioning tool
(Metis) for the experimental evaluation of the validity of the
proposed hypergraph models. We have also developed a multilevel
hypergraph partitioning heuristic for experimenting the performance of
the multilevel approach on hypergraph partitioning. Experimental
results on sparse matrices, selected from Harwell-Boeing collection
and NETLIB suite, confirm both the validity of our proposed hypergraph
models and appropriateness of the multilevel approach to hypergraph
partitioning.
BU-CEIS-9609
TITLE: Two Novel Multiway Circuit Partitioning Algorithms Using
Relaxed Locking
AUTHORS: Ali Dasdan and Cevdet Aykanat
ABSTRACT: All the previous Kernighan-Lin based circuit partitioning algorithms
employ the locking mechanism, which enforces each cell to be moved exactly
once per pass. In this paper, we propose for multiway circuit partitioning
two novel approaches to overcome this limitation. The proposed approaches
are based on our claim that the performance of a partitioning algorithm gets
better by allowing each cell to be moved more than once per pass. The first
approach introduces the phase concept so that each pass can contain more than
one phase, and each cell has a chance to be moved in all the passes. This
approach uses the locking mechanism in a relaxed manner in that it does not
allow a cell to be moved more than once in a phase. The second approach does
not use the locking mechanism at all. It introduces the mobility concept,
which allows a cell to be moved more than once in a pass. Each approach
leads to a Kernighan-Lin based generic algorithm whose parameters can be set
in such a way that better performance is obtained by spending more time.
By setting these parameters, we generated three versions of each generic
algorithm. The proposed algorithms were experimentally evaluated in comparison
with Sanchis' multiway circuit partitioning algorithm (FMS algorithm) and the
simulated annealing algorithm (SA algorithm) on benchmark
circuits. The proposed algorithms outperform FMS algorithm
significantly especially when the number of parts in the partition was
large and and the circuit was sparse. Their performance is comparable
to that of SA algorithm though the running time of SA algorithm is far
larger than those of the proposed algorithms. We also performed some
experiments on the parameters of the proposed algorithms, and provided
guidelines for good parameter settings. Besides, we gave a new
formulation of multiway circuit partitioning that combined those of
the previous approaches, and presented detailed algorithms and their
time and space complexity analysis. We observed that experimental
results supported our claim. The proposed approaches are effective and
efficient, and are applicable to almost all of the previous
Kernighan-Lin based algorithms.
BU-CEIS-9610
TITLE: OBJECTIVE: A Benchmark for Object-Oriented Active Database
Systems
AUTHORS: Ugur Cetintemel, Jurgen Zimmerman, Ozgur Ulusoy, Alejandro
Buchmann
ABSTRACT:Although much work in the area of Active Database Management Systems
(ADBMSs) has been done, it is not yet clear how the performance of an active
DBMS can be evaluated systematically. In this paper, we describe the OBJECTIVE
Benchmark for object-oriented ADBMSs, and present experimental results from
its implementation in an active database system prototype. OBJECTIVE can be
used to identify performance bottlenecks and active functionalities of an
ADBMS, and compare the performance of multiple ADBMSs.
BU-CEIS-9611
TITLE: OBJECTIVE: A Benchmark for Object-Oriented Active Database
Systems (M.Sc. thesis)
AUTHOR: Ugur Cetintemel
ABSTRACT: Although much work in the area of Active Database Management Systems
(ADBMSs) has been done, there have been only a few attempts to
evaluate the performance of these systems, and it is not yet clear how
the performance of an active DBMS can be evaluated systematically.
In this thesis, we describe the OBJECTIVE Benchmark for
object-oriented ADBMSs, and present experimental results from its
implementation in an active database system prototype. OBJECTIVE can
be used to identify performance bottlenecks and active functionalities
of an ADBMS, and compare the performance of multiple ADBMSs. The
philosophy of OBJECTIVE is to isolate components providing active
functionalities, and concentrate only on the performance of these
components while attempting to minimize the effects of other factors.
BU-CEIS-9612
TITLE: Genetic Algorithms to Learn Feature Weights for the Nearest
Neighbor Algorithm
AUTHORS: Gülsen Demiröz and H. Altay Güvenir
ABSTRACT: In this paper we use genetic algorithms to learn feature weights for
the Nearest Neighbor classification algorithm. We represent
feature weights as real values in [0..1] and their sum is 1.
A new crossover operation, called continuous uniform crossover,
is introduced where the legality of chromosomes is preserved after the
crossover operation. This new crossover operation is compared with
three other crossover operations one-point crossover>,
two-point crossover, and uniform crossover all of which
require normalization after the crossover operation. Four genetic
algorithms using each of these four crossover operations are
implemented. Each genetic algorithm tries to learn feature weights to
improve the classification accuracy of the Nearest Neighbor
algorithm. Then the learned weights are assigned to features to be
used in distance calculations of the Weighted Nearest Neighbor
classification algorithm and the resulting classification accuracies
in our datasets are compared.
BU-CEIS-9613
TITLE: Decision Tree Induction Using Genetic Programming
AUTHORS: Gökhan Tur and Halil Altay Güvenir
ABSTRACT: A decision tree induction algorithm using genetic programming (GP) is
presented. The best decision tree is defined as the one which achieves
maximum accuracy with minimum number of internal nodes. In this
approach every individual is a decision tree candidate. The
results are satisfactory in the sense that it can find the optimum
solution, i.e. the best decision tree, for a small sized dataset. For
the dataset related to American Congress, this algorithm can reach an
accuracy of 97.3%. Solutions to unknown or noise data are also
proposed in this framework.
BU-CEIS-9614
TITLE: Design and Implementation of a Tactical Generator for Turkish, a Free Constituent Order Language (M.Sc. Thesis)
AUTHOR: Dilek Zeynep Hakkani
ABSTRACT: This thesis describes a tactical generator for Turkish, a free
constituent order language, in which the order of the constituents may
change according to the information structure of the sentences to be
generated. In the absence of any information regarding the information
structure of a sentence (i.e., topic, focus, background, etc.), the
constituents of the sentence obey a default order, but the order is
almost freely changeable, depending on the constraints of the text
flow or discourse. We have used a recursively structured finite state
machine for handling the changes in constituent order, implemented as
a right-linear grammar backbone. Our implementation environment is the
GenKit system, developed at Carnegie Mellon University--Center for
Machine Translation. Morphological realization has been implemented
using an external morphological analysis/generation component which
performs concrete morpheme selection and handles morphographemic
processes.
BU-CEIS-9615
TITLE: Using Multiple Sources of Information for Constraint-based
Morphological Disambiguation (M.Sc. Thesis)
AUTHOR: Gökhan Tür
ABSTRACT: This thesis presents a constraint-based morphological disambiguation
approach that is applicable to languages with complex
morphology--specifically agglutinative languages with productive
inflectional and derivational morphological phenomena. For
morphologically complex languages like Turkish, automatic
morphological disambiguation involves selecting for each token
morphological parse(s), with the right set of inflectional and
derivational markers. Our system combines corpus independent
hand-crafted constraint rules, constraint rules that are learned via
unsupervised learning from a training corpus, and additional
statistical information obtained from the corpus to be morphologically
disambiguated. The hand-crafted rules are linguistically motivated
and tuned to improve precision without sacrificing recall. In certain
respects, our approach has been motivated by Brill's recent work,
but with the observation that his transformational approach is not directly
applicable to languages like Turkish. Our approach also uses a novel
approach to unknown word processing by employing a secondary morphological
processor which recovers any relevant inflectional and derivational
information from a lexical item whose root is unknown. With this approach,
well below 1% of the tokens remains as unknown in the texts we have
experimented with. Our results indicate that by combining these hand-crafted,
statistical and learned information sources, we can attain a recall of
96 to 97% with a corresponding precision of 93 to 94%, and ambiguity
of 1.02 to 1.03 parses per token.
BU-CEIS-9616
TITLE: Weighting Features in k Nearest Neighbor Classification on
Feature Projections
AUTHORS: Aynur Akkus and H. Altay Güvenir
ABSTRACT: This paper proposes two methods for learning feature weights
to improve the classification accuracy of the k-NNFP algorithm. In
the k-NNFP algorithm, instances are stored as their projections
on each feature dimension. The classification of unseen examples are
made on the basis of feature projections by a majority voting among
the k (k <= 1) predictions of each feature
separately. We have treated all features as equivalent in this
algorithm. However, all features may not have equal relevancy, even
some features may be completely irrelevant. In order to determine
features' relevances, the best method is to assign them weights. The
first method is based on the assumption of homogeneous feature
projections for which the number of consequent values of feature
projections of a same class supports an evidence for increasing the
probability of correct classification in the k-NNFP
algorithm. The second method is based on the individual accuracies of
features. In this approach, the k-NNFP algorithm is run on the basis
of a single feature, once for each feature. The resulting accuracy is
taken as the weight of that feature since it is a measure of
contribution to classification for that feature. Empirical evaluation
of these feature weighting methods in the k-NNFP algorithm on real
world datasets is given.
BU-CEIS-9617
TITLE: Generation of Turkish Verbal Groups with Systemic-Functional Grammar
AUTHORS: Turgay Korkmaz and Ilyas Cicekli
ABSTRACT:
This paper mainly presents the verbal group part of a Turkish
sentence generator that is currently under development
for producing the actual text from its semantic description.
To concentrate on the text generation rather
than text planning,
we assume that the semantic description of
the text is produced in some way, currently given by hand.
In the generation, we need a linguistic theory to
describe the linguistic resources, and also a software tool to
perform them in a computational environment. We use a
functional linguistic theory called Systemic--Functional Grammar
(SFG) to represent the linguistic resources,
and FUF text generation system as a software tool to perform them.
In this paper,
we present the systemic--functional
representation and realization of Turkish verbal groups.
BU-CEIS-9618
TITLE: Turkish Text Generation With Systemic-Functional Grammar
(M.Sc. Thesis)
AUTHOR: Turgay Korkmaz
ABSTRACT:
Natural Language Generation (NLG) is roughly
decomposed into two stages:
text planning, and text generation.
In the text planning stage, the semantic description of
the text is produced from the conceptual inputs.
Then, the text generation system transforms this semantic
description into an actual text.
This thesis focuses on the design and implementation of a Turkish text
generation system rather than text planning.
To develop a text generator,
we need a linguistic theory that
describes the resources of the desired natural language,
and also a software tool that represents and
performs these linguistic resources in a computational environment.
In this thesis, in order to carry out the mentioned requirements,
we have used a functional linguistic theory called
Systemic--Functional Grammar (SFG),
and the FUF text generation system as a software tool.
The ultimate text generation system takes the semantic description of
the text sentence by sentence, and then produces a morphological
description for each lexical constituent of the sentence.
The morphological descriptions are worded by a Turkish morphological
generator.
Because of our concentration on the text generation, we have not
considered the details of the text planning.
Hence, we assume that the semantic description of the text is produced
and lexicalized by an application (currently given by hand).
BU-CEIS-9619
TITLE: Learning Translation Rules From A Bilingual Corpus
AUTHOR: Ilyas Cicekli and H. Altay Güvenir
ABSTRACT: This paper proposes a mechanism for learning
pattern correspondences between two languages from a corpus of
translated sentence pairs. The proposed mechanism uses analogical
reasoning between two translations. Given a pair of translations, the
similar parts of the sentences in the source language must correspond
the similar parts of the sentences in the target language. Similarly,
the different parts should correspond to the respective parts in the
translated sentences. The correspondences between the similarities,
and also differences are learned in the form of translation rules.
The system is tested on a small training dataset and produced
promising results for further investigation.
BU-CEIS-9620
TITLE:Nonsymmetric Skyline versus Compressed Sparse Row Format
in Direct Methods for Markov Chains
AUTHOR: Tugrul Dayar
ABSTRACT: This paper investigates the implementation of Gaussian
Elimination (GE) and Grassmann-Taksar-Heyman (GTH) algorithms for
Markov chains in nonsymmetric skyline (NSK) and compressed sparse
row (CSR) formats. Numerical experiments are carried out using three
real life applications. Implementations that allocate space for both
lower- and upper-triangular factors result in faster GTH solvers.
Specifically, the CSR implementation of the GTH algorithm for the
nontransposed linear system of equations gives the smallest run-time
among different GTH implementations considered. For GE, the experiments
suggest the CSR implementation for the transposed system of equations
as the appropriate choice.
BU-CEIS-9621
TITLE: Animating Facial Images with Drawings (M.Sc. Thesis)
AUTHOR: Gamze Dilek Tunali
ABSTRACT: he work presented here describes the power of 2D animation
with texture mapping controlled by line drawings. Animation is specifically
intended for facial animation and not restricted by the human face. We
initially have a sequence of facial images which are taken from a video
sequence of the same face and an image of another face to be animated. The
aim is to animate the face image with the same expressions as those of the
given video sequence. To realize the animation, a set of frames are taken
from a video sequence. Key features of the first frame are rotoscoped and
the other frames are automatically rotoscoped using the first frame.
Similarly, the corresponding features of the image which will be
animated are rotoscoped. The key features of the first frame of the
sequence and the image to be animated are mapped and using
cross-synthesis procedure, other drawings for the given image are
produced. Using these animated line drawings and the original image,
the corresponding frame sequence is produced by image warping. The
resulting sequence has the same expressions as those of the video
sequence. This work encourages the reuse of animated motion by
gathering facial motion sequences into a database. Furthermore, by
using motion sequences of a human face, non-human characters can be
animated realistically or complex characters can be animated by the
help of motion sequences of simpler characters.
BU-CEIS-9622
TITLE: Learning Translation Templates from Bilingual Texts
AUTHOR: H. Altay Güvenir and Ilyas Cicekli
ABSTRACT: This paper proposes a mechanism for learning structural
correspondences between two languages from a corpus of translated
sentence pairs. The proposed mechanism uses analogical reasoning
between two translations. Given a pair of translations, the similar
parts of the sentences in the source language must correspond the
similar parts of the sentences in the target language. Similarly, the
different parts should correspond to the respective parts in the
translated sentences. The correspondences between the similarities,
and also differences are learned in the form of translation
templates. The system is tested on a small training dataset and
produced promising results for further investigation.
BU-CEIS-9623
TITLE: Batch Learning of Disjoint Feature Intervals (M.Sc. Thesis)
AUTHOR: Aynur Akkus
ABSTRACT: This thesis presents several learning algorithms for
multi-concept descriptions in the form of disjoint feature intervals,
called Feature Interval Learning algorithms (FIL). These
algorithms are batch supervised inductive learning algorithms, and use
feature projections of the training instances for the representation
of the classification knowledge induced. These projections can be
generalized into disjoint feature intervals. Therefore, the concept
description learned is a set of disjoint intervals separately for each
feature. The classification of an unseen instance is based on the
weighted majority voting among the local predictions of features. In
order to handle noisy instances, several extensions are developed by
placing weights to intervals rather than features. Empirical
evaluation of the FIL algorithms is presented and compared with some
other similar classification algorithms. Although the FIL algorithms
achieve comparable accuracies with other algorithms, their average
running times are much more less than the others.
This thesis also presents a new adaptation of the well-known
k-NN classification algorithm to the feature projections
approach, called k-NNFP for k-Nearest Neighbor on Feature
Projections, based on a majority voting on individual
classifications made by the projections of the training set on each
feature and compares with the k-NN algorithm on some real-world
and artificial datasets.
BU-CEIS-9624
TITLE: Morphological Disambiguation by Voting Constraints
AUTHOR: Kemal Oflazer and Gökhan Tür
ABSTRACT: We present a constraint-based morphological
disambiguation system in which individual constraints vote on matching
morphological parses, and disambiguation of all the tokens in a
sentence is performed at the very end by selecting parses that
receive the highest votes. This constraint application paradigm
makes the outcome of the disambiguation independent of the rule sequence,
and hence relieves the rule developer from worrying about potentially
conflicting rule sequencing found in other systems. The vote of each
rule is determined by its complexity measured with the kind and number
of features used in the rule.
We have applied our approach in a system for morphological
disambiguation of Turkish, a language with complex
agglutinative word structures, displaying rather different
types of morphological ambiguity not found in languages like
English.
Our results indicate that using about 500
constraint rules and some additional simple statistics,
we can attain a recall of 95-96% and a precision of 94-95%
with about 1.01 parses per token. Our system is
implemented in Prolog and we are currently investigating
an
efficient implementation based on discrimination
networks used in AI production systems
BU-CEIS-9625
TITLE: FAST INFORMATION RETRIEVAL USING COMPRESSED MULTI-FRAGMENTED
SIGNATURE FILES
AUTHOR: Seyit KOCBERBER and Fazli CAN
ABSTRACT:
A new file method, called Compressed Multi-Fragmented Signature
File (C-MFSF), that uses a partial query evaluation strategy with
compressed signature bit slices is presented. The experiments show
that for queries with more than one term, C-MFSF obtains the query
results with fewer disk accesses than the inverted file approach.
For single term queries, which is rare for very large databases,
it requires more disk accesses than the inverted file method while its
response time is still satisfactory. The number of disk accesses is the
same as the number of query terms for queries with more than two terms.
The experiments with a real database of 152,850 records validate the
simulation results and are used to project the response time for very
large databases. For a database of one million records, the projected
response times for single term and multi-term queries are less than 2
and 1.14 seconds, respectively.
BU-CEIS-9626
TITLE:Learning Translation Templates from Examples
AUTHORS: H. Altay Guvenir and Ilyas Cicekli
ABSTRACT:This paper proposes a mechanism for learning lexical level
correspondences between two languages from a set of translated
sentence pairs. The proposed mechanism is based on an analogical
reasoning between two translation examples. Given two translation
examples, the similar parts of the sentences in the source language
must correspond the similar parts of the sentences in the target
language. Similarly, the different parts should correspond to the
respective parts in the translated sentences. The correspondences
between the similarities, and also differences are learned in the form
of translation templates. The approach has been implemented and tested
on a small training dataset and produced promising results for further
investigation.