Bilkent University
Department of Computer Engineering


A Template-Independent Content Extraction Approach for News Web Pages


Ahmet Yeniçağ
MSc Student
Computer Engineering Department
Bilkent University

News web pages contain additional elements such as advertisements, hyperlinks, and reader comments. These elements make the extraction of news contents a challenging task. Current news content extraction (NCE) methods are usually template-based. They require regular maintenance, since news providers frequently change their web page templates. Therefore, there is a need for NCE methods that extract news contents accurately without depending on web page templates. In this paper, a template-independent News content EXTraction approach, called N-EXT, is introduced. It first parses a web page into its blocks according to the HTML tags. Then, it examines all blocks to detect the one that contains the major part of the news content. For this purpose, it assigns weights to the blocks by considering both their textual size and similarity to the news title. The one with the highest weight is considered as the news block. It eliminates the sentences in the news block that are not related to the news content by calculating similarities of sentences to the news block. Finally, it also examines other blocks to detect the rest of the news content. We demonstrate the template-independence of N-EXT on pages obtained from several news websites. The experimental results show the accuracy of our method.

Keywords: Blocking tags, news block detection (NBD), news content extraction (NCE), NCE test collection, web information aggregators, wrappers.


DATE: 03 September, 2012, Monday @ 10:40