Theses Supervised by Ilyas Cicekli


Hande Dogan
Example Based Machine Translation with Type Associated Translation Examples
M.S. Thesis, January 2007

ABSTRACT: Example based machine translation is a translation technique that leans on machine learning paradigm. This technique had been modeled by the learning process as: a man is given short and simple sentences in language A with their correspondences in language B; he memorizes these pairs and then becomes able to translate new sentences via these pairs in the memory. In our system the translation pairs are kept as translation templates. A translation template is induced from given two translation examples by replacing differing parts in these examples by variables. A variable replacing a difference that consists of two differing parts (one from the first example, and the other one from the second example) is a generalization of those two differing parts and these variables are supported with part-of-speech tag information in order to deteriorate incorrect translations. After the learning phase, translation is achieved by finding the appropriate template(s) and replacing the variables.  ( pdf copy  )

 

Yasin Uzun
Induction Of Logical Relations Based On Specific Generalization Of Strings
M.S. Thesis, January 2007

ABSTRACT: Learning logical relations from examples expressed as first order facts has been studied extensively by the Inductive Logic Programming research. Learning with positive-only data may cause overgeneralization of examples leading to inconsistent resulting hypotheses. A learning heuristic inferring specific generalization of strings based on unique match sequences is shown to be capable of learning predicates with string arguments. This thesis outlines the effort showed to build an inductive learner based on the idea of specific generalization of strings that generalizes given clauses considering the background knowledge using least general generalization schema. The system is also extended to generalize predicates having numeric arguments and shown to be capable of learning concepts such as family relations, grammar learning and predicting mutagenecity using numeric data.  ( pdf copy  )

 

Gonenc Ercan
Automated Text Summarization And Keyphrase Extraction
M.S. Thesis, September 2006

ABSTRACT:  As the number of electronic documents increase rapidly, the need for faster techniques to asses the relevance of documents emerges. A summary can be considered as a concise representation of the underlying text. To form an ideal summary, a full understanding of the document is essential. For computers, full understanding is difficult, if not impossible. Thus, selecting important sentences from the original text and presenting these sentences as a summary is a common technique in automated text summarization research.

 

The lexical cohesion structure of the text can be exploited to determine the importance of a sentence/phrase. Lexical chains are useful tools to analyze the lexical cohesion structure in a text. This thesis discusses our research on automated text summarization and keyphrase extraction using lexical chains. We investigate the effect of the use of lexical cohesion features in keyphrase extraction, with a supervised machine learning algorithm. Our summarization algorithm constructs the lexical chains, detects topics roughly from lexical chains, segments the text with respect to the topics and selects the most important sentences. Our experiments show that lexical cohesion based features improve keyphrase extraction. Our summarization algorithm has achieved good results, compared to some other lexical cohesion based algorithms. ( pdf copy  )

 

Ozlem Istek
A Link Grammar For Turkish
M.S. Thesis, August 2006

ABSTRACT:  Syntactic parsing, or syntactic analysis, is the process of analyzing an input sequence in order to determine its grammatical structure, i.e. the formal relationships between the words of a sentence, with respect to a given grammar. In this thesis, we developed the grammar of Turkish language in the link grammar formalism. In the grammar, we used the output of a fully described morphological analyzer, which is very important for agglutinative languages like Turkish. The grammar that we developed is lexical such that we used the lexemes of only some function words and for the rest of the word classes we used the morphological feature structures. In addition, we preserved the some of the syntactic roles of the intermediate derived forms of words in our system. ( pdf copy  )

 

Baris Eker
Turkish Text to Speech System
M.S. Thesis, April 2002

ABSTRACT:  Scientists have been interested in producing human speech artificially for more than two centuries. After the invention of computers, computers are used in order to synthesize speech. By the help of this new technology, Text To Speech (TTS) systems that take a text as input and produce speech as output were started to be created. Some languages like English and French has taken most of the attention and some languages like Turkish has not been taken into consideration.

 

This thesis presents a TTS system for Turkish that uses diphone concatenation method. It takes a text as input and produces corresponding speech in Turkish. The output can be obtained in one male voice only in that system. Since Turkish is a phonetic language, this system also can be used for other phonetic languages with some minor modifications. If this system is integrated with a pronunciation unit, it can also be used for languages that are not phonetic. ( pdf copy  )

 

Goker Canitezer
Generalization of Predicates with String Arguments
M.S. Thesis, January 2002

ABSTRACT:  String/sequence generalization is used in many  different areas such as machine learning, example-based machine translation and DNA sequence alignment. In this thesis, a method is proposed to find the generalizations of the predicates with string arguments from the given examples. Trying to learn from examples is a very hard problem in machine learning, since finding the global optimal point to stop generalization is a difficult and time consuming process. All the work done until now is about employing a heuristic to find the best solution. This work is one of them. In this study, some restrictions applied by the SLGG (Specific Least General Generalization) algorithm, which is developed to be used in an example-based machine translation system, are relaxed to find the all possible alignments of two strings. Moreover, a Euclidian distance like scoring mechanism is used to find the most specific generalizations. Some of the generated templates are eliminated by four different selection/filtering approaches to get a  good solution set. Finally, the result set is presented as a decision list, which provides the handling of exceptional cases. ( pdf copy  )

 

Kemal Altintas
Turkish to CrimeanTatar Machine Translation System
M.S. Thesis,  July 2001

ABSTRACT:  Machine translation has always been interesting to people since the invention of computers. Most of the research has been conducted on western languages such as English and French, and Turkish and Turkic languages have been left out of the scene. Machine translation between closely related languages is easier than between language pairs that are not related with each other. Having many parts of their grammars and vocabularies in common reduces the amount of effort needed to develop a translation system between related languages. A translation system that makes a morphological analysis supported by simpler translation rules and context dependent bilingual dictionaries would suffice most of the time. Usually a semantic analysis may not be needed.

 

This thesis presents a machine translation system from Turkish to Crimean Tatar that uses finite state techniques for the translation process. By developing a machine translation system between Turkish and Crimean Tatar, we propose a sample model for translation between close pairs of languages. The system we developed takes a Turkish sentence, analyses all the words morphologically, translates the grammatical and context dependent structures, translates the root words and finally morphologically generates the Crimean Tatar text. Most of the time, at least one of the outputs is a true translation of the input sentence. ( pdf copy  )

 

Atacan Cundoroglu
Error Tolerant Finite State Parsing for a Turkish Dialogue System
M.S. Thesis,  July 2001

ABSTRACT:  In NLP (Natural Language Processing), high level grammar formalisms are frequently employed for parsing. Since in practice no formalism can cope with the diversity and the flexibility of the human languages, such formalisms are used in closed domains, with sub-languages. Even though we believe that in an open world sophisticated analysis is required for extracting meaning from natural language texts, this does not have to be the case for the closed domains. Simpler time-efficient finite state methods can be used in closed domains. With their simplicity and time-efficiency, finite state methods are not only responsive, but also easy to augment with error tolerance which allows these methods to flexibly parse mildly ungrammatical sentences. In this thesis, we present a parser module which is based on error tolerant finite state recognition and a grammar for parsing transcribed dialogue utterances in a closed Turkish banking domain. Test results on the syntheticly created erroneous sentences indicate that the proposed system can analyze ungrammatical sentences efficiently and can scale with the growth of the grammar.  ( postscript copy  )

       

Umut Topkora
Prefix-Suffix Based Statistical Language Model fo Turkish
M.S. Thesis,  July 2001

ABSTRACT:  As large amount of online text became available, concisely representing quantitative information about language and doing inference on this information for natural language applications have become an attractive research area. Statistical language models try to estimate the unknown probability distribution P(u) that is assumed to have produced large text corpora of linguistic units u. This probability distribution estimate is used to improve the performance of many natural language processing applications including speech recognition (ASR), optical character recognition (OCR), spelling and grammar correction, machine translation and document classification. Statistical language modeling has been successfully applied to English. However, this good performance of approaches to statistical modeling of English does not apply to Turkish. Turkish has  a productive agglutinative morphology, that is, it's possible to derive thousands of word forms from a given root word through adding suffixes. When, statistical modeling by word units is used, this lucrative vocabulary structure causes data sparseness problems in general and serious space problems in time-memory critical applications such as speech recognition.

 

According to a recent Ph.D. thesis by Hakkani-Tur, using fixed size prefix and suffix parts of words for statistical modeling of Turkish performs better than using whole words for the task of selecting the most likely sequence of words from a list of candidate words emitted by a speech recognizer. After these successful results, we have made further research on using smaller units for statistical modeling of Turkish. We have used fixed number of syllables for prefix and suffix parts.  In our experiments we have used small vocabulary of prefixes and suffixes to test the robustness of our approach. We also compared the performance of prefix-suffix language models having 2-word context with word 2-gram  models. We have found a language model that uses subword units and can perform as well as a large word based language model in 2-word context and still be half in size.  ( postscript copy  )

 

Ayse Pinar Saygin
Turing Test and Conversation
M.S. Thesis,  July 1999

ABSTRACT:  The Turing Test is one of the most disputed topics in Artificial Intelligence, Philosophy of Mind and Cognitive Science. It has been proposed 50 years ago, as a method to determine whether machines can think or not. It embodies important philosophical issues, as well as computational ones. Moreover, because of its characteristics, it requires interdisciplinary attention. The Turing Test posits that, to be granted intelligence, a computer should imitate human conversational behavior so well that it should be indistinguishable from a real human being. From this, it follows that conversation is a crucial concept in its study. Surprisingly, focusing on conversation in relation to the Turing Test has not been a prevailing approach in previous research. This thesis first provides a thorough and deep review of the 50 years of the Turing Test. Philosophical arguments, computational concerns, and repercussions in other disciplines are all discussed. Furthermore, this thesis studies the Turing Test as a special kind of conversation. In doing so, the relationship between existing theories of conversation and human-computer communication is explored. In particular, Grice's cooperative principle and conversational maxims are concentrated on. Viewing the Turing Test as conversation and computers as language users have significant effects on the way we look at Artificial Intelligence, and on communication in general. ( postscript copy  )
 

Zeynep Orhan
Confidence Factor Assignment to Translation Templates
M.S. Thesis,  September 1998

ABSTRACT: TTL (Translation Template Learner) algorithm learns lexical level correspondences between two translation examples by using analogical reasoning. The sentences used as translation examples have similar and different parts in the source language which must correspond to the similar and different parts in the target language. Therefore, these correspondences are learned as translation templates. The learned translation templates are used in the translation of other sentences. However, we need to assign confidence factors to these translation templates to order translation results with respect to previously assigned confidence factors. This thesis proposes a method for assigning confidence factors to translation templates learned by the TTL  algorithm. In this process, each template is assigned a confidence factor according to the statistical information obtained from training data. Furthermore, some template combinations are also assigned confidence factors in order to eliminate certain combinations resulting bad translation. ( pdf copy  )
 

Selman Murat Temizsoy
Design and Implementation of a System for Mapping Text Meaning Representations to F-Structures of Turkish Sentences
M.S. Thesis,  August 1997

ABSTRACT:Interlingua approach to Machine Translation (MT) aims to achieve the translation task in two independent steps. First, the meanings of source language sentences are represented in a language-independentartificial language. Then, sentences of the target language are generated from those meaning representations. Generation task in this approach is performed in three major steps among which the second step creates the syntactic structure of a sentence from its meaning representation and selects the words to be used in that sentence. This thesis focuses on the design and the implementation of a prototype system that performs this second task. The meaning representation used in this work utilizes a hierarchical world representation,  ontology, to denote events and entities, and embeds semantic and pragmatic issues with special frames. The developed system is language-independent and it takes information about the target language from three knowledge resources: lexicon (word knowledge),  map-rules (the relation between the meaning representation and the syntactic structure), and target language's syntactic structure representation. It performs two major tasks in processing the meaning representation: lexical selection and mapping the two representations of a sentence. The implemented system is tested on Turkish using small-sized knowledge resources developed for Turkish. The output of the system can be fed as input to a tactical generator, which is developed for Turkish, to produce the final Turkish sentences. ( pdf copy  )
 

Dilek Zeynep Hakkani
Design and Implementation of a Tactical Generator for Turkish, A Free Constituent Order Language
M.S. Thesis,  July 1996   (co-supervised by Kemal Oflazer)

ABSTRACT:This thesis describes a tactical generator for Turkish, a free constituent order language, in which the order of the constituents may change according to the information structure of the sentences to be generated. In the absence of any information regarding the information structure of a sentence (i.e., topic, focus, background, etc.), the constituents of the sentence obey a default order, but the order is almost freely changeable, depending on the constraints of the text flow or discourse. We have used a recursively structured finite state machine for handling the changes in constituent order, implemented as a right-linear grammar backbone. Our implementation environment is the GenKit system, developed at Carnegie Mellon University--Center for Machine Translation. Morphological realization has been implemented using an external morphological analysis/generation component which performs concrete morpheme selection and handles morphographemic processes. ( pdf copy )
 

Turgay Korkmaz
Turkish Text Generation with Systemic-Functional Grammar
M.S. Thesis,  June 1996

ABSTRACT: Natural Language Generation (NLG) is roughly decomposed into two stages: text planning, and text generation. In the text planning stage, the semantic description of the text is produced from the conceptual inputs. Then, the text generation system transforms this semantic description into an actual text. This thesis focuses on the design and implementation of a Turkish text generation system rather than text planning. To develop a text generator, we need a linguistic theory that describes the resources of the desired natural language, and also a software tool that represents and performs these linguistic resources in a computational environment. In this thesis, in order to carry out the mentioned requirements, we have used a functional linguistic theory called Systemic--Functional Grammar (SFG), and the FUF text generation system as a software tool. The ultimate text generation system takes the semantic description of the text sentence by sentence, and then produces a morphological description for each lexical constituent of the sentence. The morphological descriptions are worded by a Turkish morphological generator. Because of our concentration on the text generation, we have not considered the details of the text planning. Hence, we assume that the semantic description of the text is produced and lexicalized by an application (currently given by hand).  (pdf copy)