Please wait a minute...
New Technology of Library and Information Service  2009, Vol. 25 Issue (5): 44-49    DOI: 10.11925/infotech.1003-3513.2009.05.09
Current Issue | Archive | Adv Search |
A Web Information Extractor Based on the Combination of Ontology and DOM
Liu Jiagang  Chen Shan  He Lingya
(Department of Computer Science,Hunan Institute of Technology,Hengyang 421002,China)
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

In terms of the weakness that information extraction based on information item Ontology of Web page can not partition accurately the areas of extraction, an improved Web information extractor based on Ontology and DOM is designed. This paper utilizes the DOM tree to design an inductive learning algorithm for the path of information items in sample Web pages. Through this algorithm, the areas of information extraction can be partitioned accurately, the noises of sample Web page can be reduced, and the preprocessing of the Web page can be implemented. The experiment shows that the improved approach can increase the precision of information extraction.

Key wordsInformation extraction      Wrapper      Ontology      DOM      Inductive learning     
Received: 23 March 2009      Published: 25 May 2009
: 

TP391.3

 
Corresponding Authors: Liu JiaGang     E-mail: superljg@tom.com
About author:: Liu Jiagang,Chen Shan,He Lingya

Cite this article:

Liu Jiagang,Chen Shan,He Lingya. A Web Information Extractor Based on the Combination of Ontology and DOM. New Technology of Library and Information Service, 2009, 25(5): 44-49.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2009.05.09     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2009/V25/I5/44

[1] 许建潮,侯锟. Web信息的自主抽取方法[J]. 计算机工程与应用, 2005,41(14):185-189.
[2] Silvescu A, Reinoso-Castillo J, Honavar V. Ontology-driven Information Extraction and Knowledge Acquisition from Heterogeneous, Distributed Biological Data Sources[C/OL]. In:Proccedings of the LJCAI-2001 Workshop on Knowledge Discovery from Heterogeneous, Distributed, Autonomous, Dynamic Data and Knowledge Sources, 2001.[2008-11-01]. http://www.cs.iastate.edu/~honavar/Papers/ijcaiworkshop-paper.pdf.
[3] Maedche A, Neumann G, Staab S. Bootstrapping an Ontology-based Information Extraction System[A]//Intelligent Exploration of the Web, Studies in Fuzziness and Soft-Computing[C]. Heidelberg:Physica-Verlag Gmb H, 2003:345-359.
[4] Staab S, Madche A, Handschuh S. An Annotation Framework for the Semantic Web[C]. In:Proceedings of the First International Workshop on Multi-Media Annotation, Tokyo, Japan, January 30-31, 2001.
[5] 王放,顾宁,吴国文. 基于本体的Web表格信息抽取[J]. 小型微型计算机系统, 2003,24(12):2142-2146.
[6] 张成洪,王向安,古晓洪. 利用Ontology和规则表达式的Web信息抽取[J]. 计算机工程, 2004,30(5):58-60.
[7]  何召卫,陈俊亮. 基于本体关系匹配的信息抽取[J]. 计算机工程, 2007,33(21):207-209.
[8]  高军,王腾蛟,杨冬青等. 基于Ontology的Web内容二阶段半自动提取方法[J]. 计算机学报, 2004,27(3):310-318.
[9]  刘耀,穗志方. 领域Ontology概念描述体系构建方法探析[J]. 大学图书馆学报, 2006,24(5):28-33.
[10]  徐静,孙坦,黄飞燕. 近两年国外本体应用研究进展[J]. 图书馆建设, 2008, (8):84-90.
[11]  周明健,高济,李飞. 基于本体论的Web信息抽取[J]. 计算机辅助设计与图形学学报, 2004,16(4):535-541.
[12]  刘辉,陈静玉,徐学洲. 基于模板流程配置的Web信息抽取[J]. 计算机工程, 2008,34(20):55-57.
[13]  支宗良,陈少飞. 一种基于XQuery的优化Web信息抽取方法[J]. 计算机应用, 2008,28(1):152-154.
[14]  冀高峰,汤庸,道炜等. 基于XML的自动学习Web信息抽取[J]. 计算机科学, 2008,35(3):87-90.
[15]  杨敬伟,杨文柱,高悦. 基于DOM的Web信息抽取规则的构造与实现[J]. 河北大学学报(自然科学版), 2007,27(2):209-212.
[16]  于琨,蔡智,糜仲春等. 基于路径学习的信息自动抽取方法[J]. 小型微型计算机系统, 2003,24(12):2147-2149.

[1] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] Shan Xiaohong,Wang Chunwen,Liu Xiaoyan,Han Shengxi,Yang Juan. Identifying Lead Users in Open Innovation Community from Knowledge-based Perspectives[J]. 数据分析与知识发现, 2021, 5(9): 85-96.
[3] Liu Yuanchen, Wang Hao, Gao Yaqi. Predicting Online Music Playbacks and Influencing Factors[J]. 数据分析与知识发现, 2021, 5(8): 100-112.
[4] Tan Ying, Tang Yifei. Extracting Citation Contents with Coreference Resolution[J]. 数据分析与知识发现, 2021, 5(8): 25-33.
[5] Chen Wenjie,Wen Yi,Yang Ning. Fuzzy Overlapping Community Detection Algorithm Based on Node Vector Representation[J]. 数据分析与知识发现, 2021, 5(5): 41-50.
[6] Cheng Bin,Shi Shuicai,Du Yuncheng,Xiao Shibin. Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[7] Sheng Shu, Huang Qi, Yang Yang, Xie Qiwen, Qin Xinguo. Exchanging Chinese Medical Information Based on HL7 FHIR[J]. 数据分析与知识发现, 2021, 5(11): 13-28.
[8] Zheng Xinman, Dong Yu. Constructing Degree Lexicon for STI Policy Texts[J]. 数据分析与知识发现, 2021, 5(10): 81-93.
[9] Zeng Zhen,Li Gang,Mao Jin,Chen Jinghao. Data Governance and Domain Ontology of Regional Public Security[J]. 数据分析与知识发现, 2020, 4(9): 41-55.
[10] Zhao Ping,Sun Lianying,Tu Shuai,Bian Jianling,Wan Ying. Identifying Scenic Spot Entities Based on Improved Knowledge Transfer[J]. 数据分析与知识发现, 2020, 4(5): 118-126.
[11] Li Chengliang,Zhao Zhongying,Li Chao,Qi Liang,Wen Yan. Extracting Product Properties with Dependency Relationship Embedding and Conditional Random Field[J]. 数据分析与知识发现, 2020, 4(5): 54-65.
[12] Qi Ruihua,Jian Yue,Guo Xu,Guan Jinghua,Yang Mingxin. Sentiment Analysis of Cross-Domain Product Reviews Based on Feature Fusion and Attention Mechanism[J]. 数据分析与知识发现, 2020, 4(12): 85-94.
[13] Peng Chen,Lv Xueqiang,Sun Ning,Zang Le,Jiang Zhaocai,Song Li. Building Phrase Dictionary for Defective Products with Convolutional Neural Network[J]. 数据分析与知识发现, 2020, 4(11): 112-120.
[14] Wang Sili,Zhu Zhongming,Yang Heng,Liu Wei. Automatically Identifying Hypernym-Hyponym Relations of Domain Concepts with Patterns and Projection Learning[J]. 数据分析与知识发现, 2020, 4(11): 15-25.
[15] Qin Chenglei,Zhang Chengzhi. Recognizing Structure Functions of Academic Articles with Hierarchical Attention Network[J]. 数据分析与知识发现, 2020, 4(11): 26-42.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn