Bilkent University
Department of Computer Engineering


Linking Visual Features with Text for Multimedia Data Mining


Pınar Duygulu
Carnegie Mellon University

School of Computer Science




In the first part of the talk, I will present a new approach to the object recognition problem, motivated by the recent availability of large annotated image collections. This approach considers object recognition as the translation of image regions to words, similar to the translation of text from one language to another. The lexicon for the translation is learned from large annotated image collections, which consist of images that are associated with text. First, images are segmented into regions, then the regions are clustered in the feature space, categorizing the regions into a finite set of blobs. The correspondences between the blobs and the words are learned, using a method based on the Expectation Maximization algorithm. Once learned, these correspondences can be used to predict words corresponding to particular image regions (region naming), or words associated with whole images (auto-annotation). In the second part, I will talk about the Informedia Video Understanding Project which combines speech, image and natural language understanding to automatically transcribe, segment and index video for intelligent search and image retrieval. The current library consists of terabytes of broadcast news captured over the last years with automatically extracted metadata and indices. I will present the methods that I developed recently, for detecting commercials automatically and for finding similar video patterns to track similar videos and stories over time and/or over different broadcasting networks. I will also present initial results of my work on automatic naming of people in the news. This work considers face recognition on the large scale and associates face groups with extracted names using the approach presented in the first part part of the talk.


DATE: November 11, 2003, Tuesday @ 16:40