In terms of the weakness that information extraction based on information item Ontology of Web page can not partition accurately the areas of extraction, an improved Web information extractor based on Ontology and DOM is designed. This paper utilizes the DOM tree to design an inductive learning algorithm for the path of information items in sample Web pages. Through this algorithm, the areas of information extraction can be partitioned accurately, the noises of sample Web page can be reduced, and the preprocessing of the Web page can be implemented. The experiment shows that the improved approach can increase the precision of information extraction.
柳佳刚,陈山,贺令亚. 基于本体和DOM相结合的Web信息抽取器[J]. 现代图书情报技术, 2009, 25(5): 44-49.
Liu Jiagang,Chen Shan,He Lingya. A Web Information Extractor Based on the Combination of Ontology and DOM. New Technology of Library and Information Service, 2009, 25(5): 44-49.
[1] 许建潮,侯锟. Web信息的自主抽取方法[J]. 计算机工程与应用, 2005,41(14):185-189.
[2] Silvescu A, Reinoso-Castillo J, Honavar V. Ontology-driven Information Extraction and Knowledge Acquisition from Heterogeneous, Distributed Biological Data Sources[C/OL]. In:Proccedings of the LJCAI-2001 Workshop on Knowledge Discovery from Heterogeneous, Distributed, Autonomous, Dynamic Data and Knowledge Sources, 2001.[2008-11-01]. http://www.cs.iastate.edu/~honavar/Papers/ijcaiworkshop-paper.pdf.
[3] Maedche A, Neumann G, Staab S. Bootstrapping an Ontology-based Information Extraction System[A]//Intelligent Exploration of the Web, Studies in Fuzziness and Soft-Computing[C]. Heidelberg:Physica-Verlag Gmb H, 2003:345-359.
[4] Staab S, Madche A, Handschuh S. An Annotation Framework for the Semantic Web[C]. In:Proceedings of the First International Workshop on Multi-Media Annotation, Tokyo, Japan, January 30-31, 2001.
[5] 王放,顾宁,吴国文. 基于本体的Web表格信息抽取[J]. 小型微型计算机系统, 2003,24(12):2142-2146.
[6] 张成洪,王向安,古晓洪. 利用Ontology和规则表达式的Web信息抽取[J]. 计算机工程, 2004,30(5):58-60.
[7] 何召卫,陈俊亮. 基于本体关系匹配的信息抽取[J]. 计算机工程, 2007,33(21):207-209.
[8] 高军,王腾蛟,杨冬青等. 基于Ontology的Web内容二阶段半自动提取方法[J]. 计算机学报, 2004,27(3):310-318.
[9] 刘耀,穗志方. 领域Ontology概念描述体系构建方法探析[J]. 大学图书馆学报, 2006,24(5):28-33.
[10] 徐静,孙坦,黄飞燕. 近两年国外本体应用研究进展[J]. 图书馆建设, 2008, (8):84-90.
[11] 周明健,高济,李飞. 基于本体论的Web信息抽取[J]. 计算机辅助设计与图形学学报, 2004,16(4):535-541.
[12] 刘辉,陈静玉,徐学洲. 基于模板流程配置的Web信息抽取[J]. 计算机工程, 2008,34(20):55-57.
[13] 支宗良,陈少飞. 一种基于XQuery的优化Web信息抽取方法[J]. 计算机应用, 2008,28(1):152-154.
[14] 冀高峰,汤庸,道炜等. 基于XML的自动学习Web信息抽取[J]. 计算机科学, 2008,35(3):87-90.
[15] 杨敬伟,杨文柱,高悦. 基于DOM的Web信息抽取规则的构造与实现[J]. 河北大学学报(自然科学版), 2007,27(2):209-212.
[16] 于琨,蔡智,糜仲春等. 基于路径学习的信息自动抽取方法[J]. 小型微型计算机系统, 2003,24(12):2147-2149.