Please wait a minute...
New Technology of Library and Information Service  2010, Vol. 26 Issue (3): 76-81    DOI: 10.11925/infotech.1003-3513.2010.03.13
article Current Issue | Archive | Adv Search |
The Design and Application of a Web Information Extraction System Based on Web-Harvest
Zhan Jiajia
(Department of Information Management, Sun Yat-Sen University, Guangzhou 510006,China)
Download: PDF(829 KB)   HTML  
Export: BibTeX | EndNote (RIS)      

In this paper,an open source software for information extraction called Web-Harvest is detailly introduced firstly.With functional expansion and improvement,a Web information extraction system based on Web-Harvest is designed.The paper focuses on the system design idea and system process,and the design of database tables is also briefly described. Finally,the application of the system is introduced.

Key wordsWeb-Harvest      Web information extraction     
Received: 28 January 2010      Published: 25 March 2010


Corresponding Authors: Zhan Jiajia     E-mail:
About author:: Zhan Jiajia

Cite this article:

Zhan Jiajia. The Design and Application of a Web Information Extraction System Based on Web-Harvest. New Technology of Library and Information Service, 2010, 26(3): 76-81.

URL:     OR

[1] 高洪臻, 陈天文. 网络信息资源的抽取与整合技术[C]. 见:山东省图书馆学会第十三次科学讨论会论文集.2006.
[2] Crescenzi V, Mecca G. Automatic Information Extraction from Large Websites[J]. Journal of the ACM,2004,51(5): 731-779.
[3] 李宏伟, 史培中, 张素智. 一种可行的Web数据抽取包装器的设计方法[J]. 计算机应用与软件,2009,26(3):110-113.
[4] Utku I. Algorithms for Information Extraction and Dissemination on the World-Wide Web[D]. New York: Polytechnic University,2006.
[5] Chang C H, Hsu C N, Lui S C. Automatic Information Extraction from Semi-structured Web Pages by Pattern Discovery[J]. Decision Support Systems,2003,35(1):129-147.
[6] 刘桂峰, 李林, 崔志明. 一种自动抽取Web数据对象的方法[J]. 计算机应用与软件,2009,26(6):48-51.
[7] 刘云中, 林亚平, 陈治平. 基于隐马尔可夫模型的文本信息抽取[J]. 系统仿真学报,2004,16(3):507-510.
[8] 陈俊彬, 曹树金. 基于Heritrix的Web信息抽取[J]. 图书情报工作,2009,53(9):112-115.
[9] 徐健, 张智雄. 基于Nutch的Web网站定向采集系统[J]. 现代图书情报技术,2009(4):1-6.
[10] Web-Harvest[EB/OL].[2009-12-25].
[11] Heritrix Introduction[EB/OL].[2009-12-25].
[12] Nutch Tutorial[EB/OL].[2009-12-25].


[1] Nie Hui Huang Guipeng. The Application and Implementation of Tree Edit Distance in Web Information Extraction[J]. 现代图书情报技术, 2010, 26(5): 29-34.
[2] Ou Jun,Ren Minglun . Automated Extraction of Search Engine Results[J]. 现代图书情报技术, 2007, 2(2): 49-52.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938