SEMINAR

DEPARTMENT OF COMPUTER ENGINEERING

ABSTRACT

TOPIC-CENTRIC QUERYING OF WEB RESOURCES

by

İsmail Sengör Altıngövde

M.S. in Computer Engineering

Supervisor: Assoc.Prof.Dr. Özgür Ulusoy

As the world wide web (WWW) has evolved to be almost the largest source of information that is known by human being, locating relevant information on the web in a reasonably short time has become a major struggle. High quality indices and (sometimes specialized) search engines that employ information retrieval techniques are widely used for keyword based searches, and a number of web query languages have also been developed, mostly for research purposes. However, most of the keyword-based approaches are vulnerable to the noise on the web, leading to unqualified results with lots of irrelevant documents; whereas the web-query languages lack the speed or generality to be used in practical cases.

In this thesis, we make use of metadata (along with some XML-based standards) to characterize the web resource domains, and to provide sophisticated querying features with high-quality results and a reasonably fast response time. Essentially, we propose a "web information space" metadata model for web information resources, and a query language SQL-TC (Topic-Centric SQL) to query the model. The web information space model is composed of web-based information resources (XML or HTML documents on the web), expert advice repositories (domain expert specified metadata for information resources), and personalized information about users (user profiles and preferences, as XML documents).

Expert advice is specified using topics and relationships among topics (called metalinks) in a particular domain of interest, along the lines of the recently proposed topic maps. Experts also attach importance values to topics and metalinks that they specify, and link them to actual information resources on the web whenever possible, creating a semantic index over the resources. User profiles keep track of user knowledge and navigation history in terms of these topics and their (visited) sources, whereas user preferences declare users' attitudes and confidence for the choices of particular experts.

The query language SQL-TC makes use of the metadata information provided in expert advice repositories and embedded in information resources, and employs user preferences to further refine the query output. Query output objects/tuples are ranked with respect to the (expert-judged and user-preference-revised) importance values of requested topics/metalinks, and the query output is limited by either top n-ranked objects/tuples, or objects/tuples with importance values above a given threshold, or both. Therefore, the query output of SQL-TC is expected to produce highly relevant and semantically related responses to user queries within short amounts of time.

Keywords: Metadata, XML, Topic Maps, web data modeling, web querying, semantic indexing, user profile

The Seminar will be on 17.09.2001 at 10:30 in EA 409