Bilkent University
Department of Computer Engineering


A Line-based Representation for Matching Words


Ethem Fatih Can
MSc. Student
Computer Engineering Department
Bilkent University

With the increase of the number of documents available in the digital environment, efficient access to the documents becomes crucial. Manual indexing of the documents is costly; however, and can be carried out only in limited amounts. Image processing techniques are considered to deal with the problem. Although plenty of effort has been spent on optical character recognition (OCR), most of the existing OCR systems fail to address the challenge of recognizing the characters in historical documents on account of the poor quality of old documents, the high level of noise factors, and the variety of scripts. More importantly, OCR systems are usually language dependent and not available for all languages. Word spotting techniques have been proposed recently to access the historical documents with the idea that humans read whole words at a time. In these studies the words rather than the characters are considered as the basic units. Due to the poor quality of historical documents, the representation and matching of words continue to be challenging problems for word spotting. In this study we address these challenges and propose a simple but effective method for the representation schema. The images are represented by a set of line descriptors. Two different matching methods, WILD (Word Image matching using Line Descriptors) and RECS (Redif Extraction using Contour Segments), using the line-based representation are proposed as well. The line based representation, which does not require any specific pre-processing steps is applicable to different languages and scripts. The first matching method, WILD, is based on two tasks which are retrieval (i.e. querying a template image on the test bed to retrieve the most similar one), and recognition of the words in English and Ottoman historical documents. The method provides better results than the existing word spotting studies in terms of retrieval and recognition tasks. The latter method, RECS, is different than the word spotting methods in which requirement of choosing the template image to query is removed. In the method, the redifs in the handwritten Ottoman literary texts are automatically extracted. RECS provides promising results in the task of extracting the redifs which provides motivation for further and advanced studies based upon this matching technique.


DATE: 24 December, 2009, December @ 11:30