New Technology of Library and Information Service  2007, Vol. 2 Issue (1): 40-43    DOI: 10.11925/infotech.1003-3513.2007.01.10
A General Approach to Extracting Topical Information in HTML Pages
Xu Wen   Du Yuncheng    Li Yuqin   Shi Shuicai
(Chinese Information Processing Research Center,Beijing InformationScience & Technology University,Beijing 100101,China)
By researching how to extract the topical contents in different kinds of templates of Web pages, this paper introduces a new extraction methodology based on DOM. The approach transforms HTML documents into DOM trees. According to the method, the topical contents are extracted and topic-unrelated content are deleted. The result of the approach represents the HTML document which only contains the topic information.

Key wordsDOM      Information extraction      Partition      Correlativity     
Received: 09 October 2006      Published: 25 January 2007


About author:: Xu Wen,Du Yuncheng,Li Yuqin,Shi Shuicai

Xu Wen,Du Yuncheng,Li Yuqin,Shi Shuicai . A General Approach to Extracting Topical Information in HTML Pages. New Technology of Library and Information Service, 2007, 2(1): 40-43.

