The Study of Topic Information Extraction from Web Pages Based on A New Method of Topic Information Calculation
Lv Juwang 1 Du Yuncheng 1,2 Wang Hongwei 1,2 Shi Shuicai 1,2
1(Chinese Information Processing Research Center, Beijing Information Science & Technology University, Beijing 100101,China) 2(Beijing TRS Information Technology Co.Ltd, Beijing 100101,China )
Aiming at the problem that the extration of topic information from Web page is not precise enough,this paper presents a new method of calculating the topic information of Web pages,which dividing the topic information of Web pages into three forms and using different quantization method for each. Based on the ideas above,the authors combine document object model with section thinking and present the IB-DOM model.Based on the idea of divide-and-conquer, first find the region which contains the topic information, then the irrelevant information is filtered out. The experimental results show that this approach can solve the contradiction between integrity and accuracy existing in the field of automatic extraction of topical information from Web pages betterly.
吕聚旺,都云程,王弘蔚,施水才. 基于新型主题信息量化方法的Web主题信息提取研究*[J]. 现代图书情报技术, 2008, 24(12): 48-53.
Lv Juwang,Du Yuncheng,Wang Hongwei,Shi Shuicai. The Study of Topic Information Extraction from Web Pages Based on A New Method of Topic Information Calculation. New Technology of Library and Information Service, 2008, 24(12): 48-53.
[1] Lin S H, Ho J M. Discovering Informative Content Blocks from Web Documents[ C] . In:Proceedings of the 8th ACM SIGKDD International Conference,2002.
[2] 孙承杰,关毅.基于统计的网页正文信息抽取方法的研究[J].中文信息学报,2004,18(5):17-22.
[3] 王琦,唐世渭,杨冬青,等.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10): 182-188.
[4] 胡国平,张巍,王仁华.基于双层决策的新闻网页正文精确抽取[J].中文信息学报,2006,20(6):1-9,103.
[5] 范莉娅,肖田元.从HTML表格自动构建局部本体方法的研究[J].计算机集成制造系统,2007,13(9): 1780-1786.