Theses Supervised by Ilyas
Cicekli
Hande Dogan
Example Based Machine Translation with Type Associated Translation
Examples
M.S. Thesis, January 2007
ABSTRACT: Example
based machine translation is a translation technique that leans on machine
learning paradigm. This technique had been modeled by the learning process as:
a man is given short and simple sentences in language A with their
correspondences in language B; he memorizes these pairs and then becomes able
to translate new sentences via these pairs in the memory. In our system the
translation pairs are kept as translation templates. A translation template is
induced from given two translation examples by replacing differing parts in
these examples by variables. A variable replacing a difference that consists of
two differing parts (one from the first example, and the other one from the
second example) is a generalization of those two differing parts and these
variables are supported with part-of-speech tag information in order to
deteriorate incorrect translations. After the learning phase, translation is
achieved by finding the appropriate template(s) and replacing the variables.
( pdf copy )
Yasin Uzun
Induction Of Logical Relations Based On Specific
Generalization Of Strings
M.S. Thesis, January 2007
ABSTRACT: Learning
logical relations from examples expressed as first order facts has been studied
extensively by the Inductive Logic Programming research. Learning with
positive-only data may cause overgeneralization of examples leading to
inconsistent resulting hypotheses. A learning heuristic inferring specific
generalization of strings based on unique match sequences is shown to be
capable of learning predicates with string arguments. This thesis outlines the
effort showed to build an inductive learner based on the idea of specific
generalization of strings that generalizes given clauses considering the
background knowledge using least general generalization schema. The system is
also extended to generalize predicates having numeric arguments and shown to be
capable of learning concepts such as family relations, grammar learning and
predicting mutagenecity using numeric data. ( pdf copy )
Gonenc Ercan
Automated Text Summarization And Keyphrase Extraction
M.S. Thesis, September 2006
ABSTRACT: As the number of electronic documents
increase rapidly, the need for faster techniques to asses the relevance
of documents emerges. A summary can be considered as a concise representation
of the underlying text. To form an ideal summary, a full understanding of the
document is essential. For computers, full understanding is difficult, if not
impossible. Thus, selecting important sentences from the original text and
presenting these sentences as a summary is a common technique in automated text
summarization research.
The
lexical cohesion structure of the text can be exploited to determine the importance
of a sentence/phrase. Lexical chains are useful tools to analyze the lexical
cohesion structure in a text. This thesis discusses our research on automated text
summarization and keyphrase extraction using lexical
chains. We investigate the effect of the use of lexical cohesion features in keyphrase extraction, with a supervised machine learning
algorithm. Our summarization algorithm constructs the lexical chains, detects
topics roughly from lexical chains, segments the text with respect to the topics
and selects the most important sentences. Our experiments show
that lexical cohesion based features improve keyphrase
extraction. Our summarization algorithm has achieved good results, compared to
some other lexical cohesion based algorithms. ( pdf copy )
Ozlem Istek
A Link Grammar For Turkish
M.S. Thesis, August 2006
ABSTRACT: Syntactic parsing, or syntactic analysis, is the process of
analyzing an input sequence in order to determine its grammatical structure,
i.e. the formal relationships between the words of a sentence, with respect to
a given grammar. In this thesis, we developed the grammar of Turkish language
in the link grammar formalism. In the grammar, we used the output of a fully
described morphological analyzer, which is very important for agglutinative
languages like Turkish. The grammar that we developed is lexical such that we
used the lexemes of only some function words and for the rest of the word
classes we used the morphological feature structures. In addition, we preserved
the some of the syntactic roles of the intermediate derived forms of words in
our system. ( pdf copy )
Baris Eker
Turkish Text to Speech System
M.S. Thesis, April 2002
ABSTRACT: Scientists have been interested in producing human speech artificially for more than two centuries. After the invention of computers, computers are used in order to synthesize speech. By the help of this new technology, Text To Speech (TTS) systems that take a text as input and produce speech as output were started to be created. Some languages like English and French has taken most of the attention and some languages like Turkish has not been taken into consideration.
This thesis presents a TTS system for Turkish that uses diphone concatenation method. It takes a text as input and
produces corresponding speech in Turkish. The output can be obtained in one
male voice only in that system. Since Turkish is a phonetic language, this system
also can be used for other phonetic languages with some minor modifications. If
this system is integrated with a pronunciation unit, it can also be used for
languages that are not phonetic. ( pdf copy )
Goker Canitezer
Generalization of Predicates with String Arguments
M.S. Thesis, January 2002
ABSTRACT: String/sequence generalization is
used in many different
areas such as machine learning, example-based machine translation and DNA
sequence alignment. In this thesis, a method is proposed to find the
generalizations of the predicates with string arguments from the given
examples. Trying to learn from examples is a very hard problem in machine
learning, since finding the global optimal point to stop generalization is a
difficult and time consuming process. All the work done until now is about
employing a heuristic to find the best solution. This work is one of them. In
this study, some restrictions applied by the SLGG (Specific Least General
Generalization) algorithm, which is developed to be used in an example-based
machine translation system, are relaxed to find the all possible alignments of
two strings. Moreover, a Euclidian distance like scoring mechanism is used to
find the most specific generalizations. Some of the generated templates are
eliminated by four different selection/filtering approaches to get a good solution set.
Finally, the result set is presented as a decision list, which provides the
handling of exceptional cases. ( pdf copy )
Kemal Altintas
Turkish to CrimeanTatar Machine Translation
System
M.S. Thesis, July 2001
ABSTRACT: Machine translation has
always been interesting to people since the invention of computers. Most of the
research has been conducted on western languages such as English and French,
and Turkish and Turkic languages have been left out of the scene. Machine translation between closely related languages is easier than
between language pairs that are not related with each other. Having many parts
of their grammars and vocabularies in common reduces the amount of effort
needed to develop a translation system between related languages. A translation
system that makes a morphological analysis supported by simpler translation
rules and context dependent bilingual dictionaries would suffice most of the
time. Usually a semantic analysis may not be needed.
This
thesis presents a machine translation system from Turkish to Crimean Tatar that
uses finite state techniques for the translation process. By developing a
machine translation system between Turkish and Crimean Tatar, we propose a
sample model for translation between close pairs of languages. The system we
developed takes a Turkish sentence, analyses all the words morphologically,
translates the grammatical and context dependent structures, translates the
root words and finally morphologically generates the Crimean Tatar text. Most
of the time, at least one of the outputs is a true translation of the input
sentence. ( pdf copy )
Atacan Cundoroglu
Error Tolerant
M.S. Thesis, July 2001
ABSTRACT: In NLP (Natural Language Processing), high level grammar formalisms are frequently employed for parsing. Since in practice no formalism can cope with the diversity and the flexibility of the human languages, such formalisms are used in closed domains, with sub-languages. Even though we believe that in an open world sophisticated analysis is required for extracting meaning from natural language texts, this does not have to be the case for the closed domains. Simpler time-efficient finite state methods can be used in closed domains. With their simplicity and time-efficiency, finite state methods are not only responsive, but also easy to augment with error tolerance which allows these methods to flexibly parse mildly ungrammatical sentences. In this thesis, we present a parser module which is based on error tolerant finite state recognition and a grammar for parsing transcribed dialogue utterances in a closed Turkish banking domain. Test results on the syntheticly created erroneous sentences indicate that the proposed system can analyze ungrammatical sentences efficiently and can scale with the growth of the grammar. ( postscript copy )
Umut Topkora
Prefix-Suffix Based Statistical Language Model fo
Turkish
M.S. Thesis, July 2001
ABSTRACT: As
large amount of online text became available, concisely representing
quantitative information about language and doing inference on this information
for natural language applications have become an attractive research area.
Statistical language models try to estimate the unknown probability
distribution P(u) that is assumed to have produced
large text corpora of linguistic units u. This probability distribution
estimate is used to improve the performance of many natural language processing
applications including speech recognition (ASR), optical character recognition
(OCR), spelling and grammar correction, machine translation and document
classification. Statistical language modeling has been successfully applied to
English. However, this good performance of approaches to statistical modeling
of English does not apply to Turkish. Turkish has a productive agglutinative morphology,
that is, it's possible to derive thousands of word forms from a given root word
through adding suffixes. When, statistical modeling by word units is used, this
lucrative vocabulary structure causes data sparseness problems in general and
serious space problems in time-memory critical applications such as speech
recognition.
According to a recent Ph.D. thesis by Hakkani-Tur, using fixed size prefix and suffix parts of words for statistical modeling of Turkish performs better than using whole words for the task of selecting the most likely sequence of words from a list of candidate words emitted by a speech recognizer. After these successful results, we have made further research on using smaller units for statistical modeling of Turkish. We have used fixed number of syllables for prefix and suffix parts. In our experiments we have used small vocabulary of prefixes and suffixes to test the robustness of our approach. We also compared the performance of prefix-suffix language models having 2-word context with word 2-gram models. We have found a language model that uses subword units and can perform as well as a large word based language model in 2-word context and still be half in size. ( postscript copy )
Ayse Pinar Saygin
Turing Test and Conversation
M.S. Thesis, July 1999
ABSTRACT: The Turing Test is one of the most disputed topics in
Artificial Intelligence, Philosophy of Mind and Cognitive Science. It has been
proposed 50 years ago, as a method to determine whether machines can think or
not. It embodies important philosophical issues, as well as computational ones.
Moreover, because of its characteristics, it requires interdisciplinary
attention. The Turing Test posits that, to be granted intelligence, a computer
should imitate human conversational behavior so well that it should be
indistinguishable from a real human being. From this, it follows that
conversation is a crucial concept in its study. Surprisingly, focusing on
conversation in relation to the Turing Test has not been a prevailing approach
in previous research. This thesis first provides a thorough and deep review of
the 50 years of the Turing Test. Philosophical arguments, computational
concerns, and repercussions in other disciplines are all discussed.
Furthermore, this thesis studies the Turing Test as a special kind of
conversation. In doing so, the relationship between existing theories of
conversation and human-computer communication is explored. In particular,
Grice's cooperative principle and conversational maxims are concentrated on.
Viewing the Turing Test as conversation and computers as language users have
significant effects on the way we look at Artificial Intelligence,
and on communication in general. ( postscript
copy )
Zeynep Orhan
Confidence Factor Assignment to Translation Templates
M.S. Thesis, September 1998
ABSTRACT: TTL (Translation Template Learner) algorithm
learns lexical level correspondences between two translation examples by using
analogical reasoning. The sentences used as translation examples have similar
and different parts in the source language which must correspond to the similar
and different parts in the target language. Therefore, these correspondences
are learned as translation templates. The learned translation templates are
used in the translation of other sentences. However, we need to assign
confidence factors to these translation templates to order translation results
with respect to previously assigned confidence factors. This thesis proposes a
method for assigning confidence factors to translation templates learned by the
TTL algorithm. In this process, each template is
assigned a confidence factor according to the statistical information obtained
from training data. Furthermore, some template combinations are also assigned
confidence factors in order to eliminate certain combinations resulting bad
translation. ( pdf copy )
Selman Murat Temizsoy
Design and Implementation of a System for Mapping Text Meaning
Representations to F-Structures of Turkish Sentences
M.S. Thesis, August 1997
ABSTRACT:Interlingua approach to Machine Translation (MT) aims to achieve
the translation task in two independent steps. First, the meanings of source
language sentences are represented in a language-independentartificial
language. Then, sentences of the target language are generated from those
meaning representations. Generation task in this approach is performed in three
major steps among which the second step creates the syntactic structure of a
sentence from its meaning representation and selects the words to be used in
that sentence. This thesis focuses on the design and the implementation of a
prototype system that performs this second task. The meaning representation
used in this work utilizes a hierarchical world representation, ontology, to denote events and entities, and embeds
semantic and pragmatic issues with special frames. The developed system is
language-independent and it takes information about the target language from
three knowledge resources: lexicon (word knowledge),
map-rules (the relation between the meaning representation and the
syntactic structure), and target language's syntactic structure representation.
It performs two major tasks in processing the meaning representation: lexical
selection and mapping the two representations of a sentence. The implemented
system is tested on Turkish using small-sized knowledge resources developed for
Turkish. The output of the system can be fed as input to a tactical generator,
which is developed for Turkish, to produce the final Turkish sentences. ( pdf copy )
Dilek Zeynep Hakkani
Design and Implementation of a Tactical Generator for Turkish, A Free
Constituent Order Language
M.S. Thesis, July 1996 (co-supervised by Kemal
Oflazer)
ABSTRACT:This thesis describes a tactical generator for Turkish, a
free constituent order language, in which the order of the constituents may
change according to the information structure of the sentences to be generated.
In the absence of any information regarding the information structure of a
sentence (i.e., topic, focus, background, etc.), the constituents of the
sentence obey a default order, but the order is almost freely changeable,
depending on the constraints of the text flow or discourse. We have used a
recursively structured finite state machine for handling the changes in
constituent order, implemented as a right-linear grammar backbone. Our
implementation environment is the GenKit system,
developed at
Turgay Korkmaz
Turkish Text Generation with Systemic-Functional Grammar
M.S. Thesis, June 1996
ABSTRACT: Natural Language Generation (NLG) is roughly
decomposed into two stages: text planning, and text generation. In the text
planning stage, the semantic description of the text is produced from the
conceptual inputs. Then, the text generation system transforms this semantic
description into an actual text. This thesis focuses on the design and implementation
of a Turkish text generation system rather than text planning. To develop a
text generator, we need a linguistic theory that describes the resources of the
desired natural language, and also a software tool that represents and performs
these linguistic resources in a computational environment. In this thesis, in
order to carry out the mentioned requirements, we have used a functional
linguistic theory called Systemic--Functional Grammar (SFG), and the FUF text
generation system as a software tool. The ultimate text generation system takes
the semantic description of the text sentence by sentence, and then produces a
morphological description for each lexical constituent of the sentence. The
morphological descriptions are worded by a Turkish morphological generator.
Because of our concentration on the text generation, we have not considered the
details of the text planning. Hence, we assume that the semantic description of
the text is produced and lexicalized by an application (currently given by
hand). (pdf copy)