A Line Based Approach to Word Image Matching on Historical Documents


Ethem Fatih Can
Computer Engineering Department
Bilkent University

The studies on historical documents have been popular by increase in the number of texts available on the digital environment. Scholars from various disciplines such as literature, linguistic, cultural and international studies, and history study on such manuscripts. However, indexing or categorizing historical documents usually fail when it is handwritten. Since Optical Character Recognition (OCR) does not work well on historical documents or works for within a limited vocabulary particularly for texts in Arabic or Ottoman. Therefore, the idea of word image matching or word spotting was introduced for replacing the recognition task on historical documents. Word Image Matching or Word Spotting is the task of retrieving the similar word images for a given template word image or keyword. Within the context of word spotting, an efficient and effective line based approach to word image matching is developed. Pre-processing steps namely pruning such as normalization, artifact removal, and skewing is not required with the proposed approach which is tested on historical documents in English and Ottoman. The experimental results show that the proposed approach provides pretty high precision and recall scores on word spotting task. Besides, the promising approach works independently of the language and the writing style.


DATE: 17 November, 2008, Monday@ 15:40