Please wait a minute...
New Technology of Library and Information Service  2009, Vol. 25 Issue (5): 44-49    DOI: 10.11925/infotech.1003-3513.2009.05.09
Current Issue | Archive | Adv Search |
A Web Information Extractor Based on the Combination of Ontology and DOM
Liu Jiagang  Chen Shan  He Lingya
(Department of Computer Science,Hunan Institute of Technology,Hengyang 421002,China)
Download: PDF (570 KB)  
Export: BibTeX | EndNote (RIS)      
Abstract  

In terms of the weakness that information extraction based on information item Ontology of Web page can not partition accurately the areas of extraction, an improved Web information extractor based on Ontology and DOM is designed. This paper utilizes the DOM tree to design an inductive learning algorithm for the path of information items in sample Web pages. Through this algorithm, the areas of information extraction can be partitioned accurately, the noises of sample Web page can be reduced, and the preprocessing of the Web page can be implemented. The experiment shows that the improved approach can increase the precision of information extraction.

Key wordsInformation extraction      Wrapper      Ontology      DOM      Inductive learning     
Received: 23 March 2009      Published: 25 May 2009
ZTFLH: 

TP391.3

 
Corresponding Authors: Liu JiaGang     E-mail: superljg@tom.com
About author:: Liu Jiagang,Chen Shan,He Lingya

Cite this article:

Liu Jiagang,Chen Shan,He Lingya. A Web Information Extractor Based on the Combination of Ontology and DOM. New Technology of Library and Information Service, 2009, 25(5): 44-49.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2009.05.09     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2009/V25/I5/44

[1] 许建潮,侯锟. Web信息的自主抽取方法[J]. 计算机工程与应用, 2005,41(14):185-189.
[2] Silvescu A, Reinoso-Castillo J, Honavar V. Ontology-driven Information Extraction and Knowledge Acquisition from Heterogeneous, Distributed Biological Data Sources[C/OL]. In:Proccedings of the LJCAI-2001 Workshop on Knowledge Discovery from Heterogeneous, Distributed, Autonomous, Dynamic Data and Knowledge Sources, 2001.[2008-11-01]. http://www.cs.iastate.edu/~honavar/Papers/ijcaiworkshop-paper.pdf.
[3] Maedche A, Neumann G, Staab S. Bootstrapping an Ontology-based Information Extraction System[A]//Intelligent Exploration of the Web, Studies in Fuzziness and Soft-Computing[C]. Heidelberg:Physica-Verlag Gmb H, 2003:345-359.
[4] Staab S, Madche A, Handschuh S. An Annotation Framework for the Semantic Web[C]. In:Proceedings of the First International Workshop on Multi-Media Annotation, Tokyo, Japan, January 30-31, 2001.
[5] 王放,顾宁,吴国文. 基于本体的Web表格信息抽取[J]. 小型微型计算机系统, 2003,24(12):2142-2146.
[6] 张成洪,王向安,古晓洪. 利用Ontology和规则表达式的Web信息抽取[J]. 计算机工程, 2004,30(5):58-60.
[7]  何召卫,陈俊亮. 基于本体关系匹配的信息抽取[J]. 计算机工程, 2007,33(21):207-209.
[8]  高军,王腾蛟,杨冬青等. 基于Ontology的Web内容二阶段半自动提取方法[J]. 计算机学报, 2004,27(3):310-318.
[9]  刘耀,穗志方. 领域Ontology概念描述体系构建方法探析[J]. 大学图书馆学报, 2006,24(5):28-33.
[10]  徐静,孙坦,黄飞燕. 近两年国外本体应用研究进展[J]. 图书馆建设, 2008, (8):84-90.
[11]  周明健,高济,李飞. 基于本体论的Web信息抽取[J]. 计算机辅助设计与图形学学报, 2004,16(4):535-541.
[12]  刘辉,陈静玉,徐学洲. 基于模板流程配置的Web信息抽取[J]. 计算机工程, 2008,34(20):55-57.
[13]  支宗良,陈少飞. 一种基于XQuery的优化Web信息抽取方法[J]. 计算机应用, 2008,28(1):152-154.
[14]  冀高峰,汤庸,道炜等. 基于XML的自动学习Web信息抽取[J]. 计算机科学, 2008,35(3):87-90.
[15]  杨敬伟,杨文柱,高悦. 基于DOM的Web信息抽取规则的构造与实现[J]. 河北大学学报(自然科学版), 2007,27(2):209-212.
[16]  于琨,蔡智,糜仲春等. 基于路径学习的信息自动抽取方法[J]. 小型微型计算机系统, 2003,24(12):2147-2149.

[1] Zhao Ping,Sun Lianying,Tu Shuai,Bian Jianling,Wan Ying. Identifying Scenic Spot Entities Based on Improved Knowledge Transfer[J]. 数据分析与知识发现, 2020, 4(5): 118-126.
[2] Li Chengliang,Zhao Zhongying,Li Chao,Qi Liang,Wen Yan. Extracting Product Properties with Dependency Relationship Embedding and Conditional Random Field[J]. 数据分析与知识发现, 2020, 4(5): 54-65.
[3] Bengong Yu,Yumeng Cao,Yangnan Chen,Ying Yang. Classification of Short Texts Based on nLD-SVM-RF Model[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[4] Huiying Qi,Yuhe Jiang. Predicting Breast Cancer Survival Length with Multi-Omics Data Fusion[J]. 数据分析与知识发现, 2019, 3(8): 88-93.
[5] Shaohua Qiang,Yunlu Luo,Yupeng Li,Peng Wu. Ontology Reasoning for Financial Affairs with RBR and CBR[J]. 数据分析与知识发现, 2019, 3(8): 94-104.
[6] Shiqi Deng,Liang Hong. Constructing Domain Ontology for Intelligent Applications: Case Study of Anti Tele-Fraud[J]. 数据分析与知识发现, 2019, 3(7): 73-84.
[7] Zhu Fu,Yuefen Wang,Xuhui Ding. Semantic Representation of Design Process Knowledge Reuse[J]. 数据分析与知识发现, 2019, 3(6): 21-29.
[8] Han Huang,Hongyu Wang,Xiaoguang Wang. Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
[9] Wancheng Chen,Haoran Dai,Yinghan Jin. Appraising Home Prices with HEDONIC Model: Case Study of Seattle, U.S.[J]. 数据分析与知识发现, 2019, 3(5): 19-26.
[10] Zhiqiang Liu,Yuncheng Du,Shuicai Shi. Extraction of Key Information in Web News Based on Improved Hidden Markov Model[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
[11] Guangshang Gao. A Survey of User Profiles Methods[J]. 数据分析与知识发现, 2019, 3(3): 25-35.
[12] Jiaxin Ye,Huixiang Xiong. Recommending Personalized Contents from Cross-Domain Resources Based on Tags[J]. 数据分析与知识发现, 2019, 3(2): 21-32.
[13] Pengcheng Xu,Qiang Bi. Identifying Domain Experts Based on Knowledge Super-Network[J]. 数据分析与知识发现, 2019, 3(11): 89-98.
[14] Chengzhi Zhang,Zheng Li. Extracting Sentences of Research Originality from Full Text Academic Articles[J]. 数据分析与知识发现, 2019, 3(10): 12-18.
[15] Kan Liu,Haochen Du. Detecting Twitter Rumors with Deep Transfer Network[J]. 数据分析与知识发现, 2019, 3(10): 47-55.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn