Bilkent University
Department of Computer Engineering


Statistical Word Segmentation


Hande Adıgüzel
MSc Student Student
Computer Engineering Department
Bilkent University

Word segmentation is generally considered as a key step in processing Ottoman texts. Adapting traditional methods to the problem of word segmentation in Ottoman archives is hard because Ottoman words may consist of more than one sub-word, which means there are inter and intra word gaps in these documents. A different approach, statistical word segmentation, which is applied to Chinese texts before, introduces a statistical method to extract compound words from a large corpus. As a pre-processing step, line segmentation is applied to the documents. Then, connected components which are defined as a connected group of black pixels in the document image, are found. Extracted connected components are clustered into groups and tagged according to their clusters. Finally statistical features that capture the dependency among connected components of a word such as mutual information and context dependency are used to extract words.

We do not use large pre-tagged corpus for training or rule-based approaches that require a pre-defined word list because they require extensive human involvement from different disciplines which is too expensive to supply at present. Instead of these, unsupervised statistical approaches are discovered.

Keywords: Historical Manuscripts, Ottoman Documents, Word Segmentation, Line Segmentation.


DATE: 28 November, 2011, Monday @ 15:40