Please wait a minute...
New Technology of Library and Information Service  2008, Vol. 24 Issue (12): 48-53    DOI: 10.11925/infotech.1003-3513.2008.12.09
Current Issue | Archive | Adv Search |
The Study of Topic Information Extraction from Web Pages Based on A New Method of Topic Information Calculation
Lv Juwang 1  Du Yuncheng 1,2   Wang Hongwei 1,2   Shi Shuicai 1,2
1(Chinese Information Processing Research Center, Beijing Information Science & Technology University, Beijing 100101,China)
2(Beijing TRS Information Technology Co.Ltd, Beijing 100101,China )
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

 Aiming at the problem that the extration of topic information from Web page is not precise enough,this paper presents a new method of calculating the topic information of Web pages,which dividing the topic information of Web pages into three forms and using different quantization method for each. Based on the ideas above,the authors combine document object model with section thinking and present the IB-DOM model.Based on the idea of divide-and-conquer, first find the region which contains the topic information, then the irrelevant information is filtered out. The experimental results show that this approach can solve the contradiction between integrity and accuracy existing in the field of automatic extraction of topical information from Web pages betterly.

Key wordsTopic information of Web page      Information extraction      Information block      Semantic information      IB-DOM tree     
Received: 24 September 2008      Published: 25 December 2008
: 

TP391

 
Corresponding Authors: Lv Juwang     E-mail: lv.juwang@trs.com.cn
About author:: Lv Juwang,Du Yuncheng,Wang Hongwei,Shi Shuicai

Cite this article:

Lv Juwang,Du Yuncheng,Wang Hongwei,Shi Shuicai. The Study of Topic Information Extraction from Web Pages Based on A New Method of Topic Information Calculation. New Technology of Library and Information Service, 2008, 24(12): 48-53.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2008.12.09     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2008/V24/I12/48

[1] Lin S H, Ho J M. Discovering Informative Content Blocks from Web Documents[ C] . In:Proceedings of the 8th ACM SIGKDD International Conference,2002.
[2] 孙承杰,关毅.基于统计的网页正文信息抽取方法的研究[J].中文信息学报,2004,18(5):17-22.
[3] 王琦,唐世渭,杨冬青,等.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10): 182-188.
[4] 胡国平,张巍,王仁华.基于双层决策的新闻网页正文精确抽取[J].中文信息学报,2006,20(6):1-9,103.
[5] 范莉娅,肖田元.从HTML表格自动构建局部本体方法的研究[J].计算机集成制造系统,2007,13(9): 1780-1786.

[1] Tan Ying, Tang Yifei. Extracting Citation Contents with Coreference Resolution[J]. 数据分析与知识发现, 2021, 5(8): 25-33.
[2] Wang Yi,Shen Zhe,Yao Yifan,Cheng Ying. Domain-Specific Event Graph Construction Methods:A Review[J]. 数据分析与知识发现, 2020, 4(10): 1-13.
[3] Tao Yue,Yu Li,Zhang Runjie. Active Learning Strategies for Extracting Phrase-Level Topics from Scientific Literature[J]. 数据分析与知识发现, 2020, 4(10): 134-143.
[4] Zhiqiang Liu,Yuncheng Du,Shuicai Shi. Extraction of Key Information in Web News Based on Improved Hidden Markov Model[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
[5] Chengzhi Zhang,Zheng Li. Extracting Sentences of Research Originality from Full Text Academic Articles[J]. 数据分析与知识发现, 2019, 3(10): 12-18.
[6] Mu Dongmei,Jin Shan,Ju Yuanhong. Finding Association Between Diseases and Genes from Literature Abstracts[J]. 数据分析与知识发现, 2018, 2(8): 98-106.
[7] Yufeng Duan,Sisi Huang. Information Extraction from Chinese Plant Species Diversity Description Text[J]. 现代图书情报技术, 2016, 32(1): 87-96.
[8] Liu Wei, Wang Xing, Song Peiyan. A Noise Cleaning Method for Synonym Extraction Results[J]. 现代图书情报技术, 2015, 31(6): 64-70.
[9] Jiang Chuntao. Automatic Annotation of Bibliographical References in Chinese Patent Documents[J]. 现代图书情报技术, 2015, 31(10): 81-87.
[10] Li Xiangdong, Huo Yayong, Huang Li. Study of Book Pages Automatic Identification and Bibliographic Information Extraction[J]. 现代图书情报技术, 2014, 30(4): 71-77.
[11] Liu Yajing, Wang Yanxi, Hao Dan, Zhou Jinhui. Study on the Methods of Institutional Repository Supporting Research Services[J]. 现代图书情报技术, 2014, 30(3): 1-7.
[12] Zhang Han, Liu Shuangmei. Comparative Analysis of Centrality Indices in Extracting Concepts from Semantic Predication Network——Based on Disease Treatment Research[J]. 现代图书情报技术, 2013, (6): 30-35.
[13] Huang Xun, You Hongliang, Yu Yang. A Review of Relation Extraction[J]. 现代图书情报技术, 2013, 29(11): 30-39.
[14] He Lin, He Juan, Shen Gengyu, Yang Bo, Huang Shuiqing. An Approach to Discovery of Reference Control Gene for qRT-PCR Experiment Based on Texting Mining[J]. 现代图书情报技术, 2012, 28(7): 109-114.
[15] Gao Qiang, You Hongliang. Study on Named Entity Recognition Based on Cascaded Model for Field of Defense[J]. 现代图书情报技术, 2012, (11): 47-52.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn